KR20210003540A

KR20210003540A - Apparatus and method for embedding multi-vector document using semantic decomposition of complex documents

Info

Publication number: KR20210003540A
Application number: KR1020190079449A
Authority: KR
Inventors: 박종인; 김남규
Original assignee: 국민대학교산학협력단
Priority date: 2019-07-02
Filing date: 2019-07-02
Publication date: 2021-01-12
Also published as: KR102330190B1

Abstract

The present invention relates to an apparatus and a method for embedding a multi-vector document through semantic decomposition of complex documents which can correct vector distortion and accurately embed a document. The apparatus comprises: a target document parsing unit to separate all terms into tokens by parsing target documents included in a document set; a word embedding unit to convert each token into a word vector by word embedding; a keyword vector extraction unit to extract a word vector of a token designated as a keyword for each target document among the word vectors to generate a keyword vector set for each document; a keyword clustering unit to perform clustering analysis on the keyword vector set for each document to generate a plurality of keyword clusters; and a multi-vector generation unit to generate vectors for each keyword cluster based on keyword vectors included in the plurality of keyword clusters to determine the vectors for each keyword cluster as multiple vectors of a target document associated with the plurality of keyword clusters.

Description

Multi-vector document embedding device and method through semantic decomposition of compound documents {APPARATUS AND METHOD FOR EMBEDDING MULTI-VECTOR DOCUMENT USING SEMANTIC DECOMPOSITION OF COMPLEX DOCUMENTS}

본 발명은 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 기술에 관한 것으로, 더욱 상세하게는 벡터 왜곡 현상을 보정하고 문서를 더욱 정확하게 임베딩할 수 있는 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법에 관한 것이다.The present invention relates to a technology for embedding a multi-vector document through semantic decomposition of a compound document, and more particularly, embedding a multi-vector document through semantic decomposition of a compound document capable of correcting vector distortion and embedding documents more accurately. It relates to an apparatus and method.

데이터보다 빅데이터라는 용어가 더욱 빈번하게 사용될 정도로, 일상 생활에서 생성, 유통, 활용되는 데이터의 양은 빠르게 증가하고 있다. 또한, 인터넷, 소셜 네트워크 서비스, IoT 등을 통해 사용자가 하루 동안 수행하는 상호작용의 수는 향후 급증할 것으로 예상되고 있다. 이와 같은 상호작용은 주로 텍스트를 매개로 이루어진다는 점에서, 향후 유통되는 텍스트 데이터의 양은 현재에 비해 급증할 것으로 충분히 예상할 수 있다. 이처럼 데이터 생태계에서 텍스트 데이터가 차지하는 비중이 높아짐에 따라, 텍스트 데이터에 대한 체계적 관리 및 다양한 분석을 통해 새로운 지식을 창출하고자 하는 시도도 매우 활발히 이루어지고 있다.To the extent that the term big data is used more frequently than data, the amount of data created, distributed, and used in everyday life is increasing rapidly. In addition, the number of interactions that users perform during the day through the Internet, social network services, and IoT is expected to increase rapidly in the future. Since such interactions are mainly made through text, the amount of text data distributed in the future can be expected to increase sharply compared to the present. As the proportion of text data in the data ecosystem increases, attempts to create new knowledge through systematic management and various analysis of text data are also being made very actively.

다양한 연산 및 전통적인 분석 기법의 직접 적용이 가능한 정형 데이터와 달리, 모든 비정형 텍스트는 분석에 앞서 원본 문서를 컴퓨터가 이해할 수 있는 형태로 변환하는 구조화 작업이 선행되어야 한다. 텍스트 데이터의 구조화를 위해 임의의 객체를 대수적 성질을 유지하면서 특정 차원의 공간에 사상하는 것을 임베딩(Embedding)이라고 하며, 구체적으로는 단어를 벡터로 나타내는 단어 임베딩(Word Embedding)과 문서를 벡터로 나타내는 문서 임베딩(Document Embedding)으로 실현된다. Unlike structured data that can be directly applied to various operations and traditional analysis techniques, all unstructured texts must be structured to convert the original document into a form that can be understood by a computer before analysis. The mapping of an arbitrary object into a space of a specific dimension while maintaining algebraic properties for structuring text data is called embedding. Specifically, word embedding, which represents words as vectors, and documents are represented as vectors. It is realized by document embedding.

단어의 구조화를 위한 가장 고전적인 방식은 원 핫 인코딩(One-hot Encoding) 방식이다. 이는 단어 집합의 크기에 해당하는 차원을 갖는 단어 벡터 공간을 생성하고, 각 단어에 인덱스를 부여한 뒤 해당 단어에 대응되는 인덱스의 차원 값을 1, 그 외의 차원 값을 0으로 설정하는 방식이다. 이러한 방식은 이해가 쉽고 구현도 용이하지만, 단어 수의 증가에 따라 계산 비용도 증가한다는 비효율성과 단어 벡터가 해당 단어의 의미를 충분히 반영하지 못한다는 한계를 갖는다.The most classical method for structuring words is the One-hot Encoding method. This is a method of creating a word vector space having a dimension corresponding to the size of the word set, assigning an index to each word, and then setting the dimension value of the index corresponding to the word to 1 and other dimension values to 0. This method is easy to understand and easy to implement, but has limitations in that the computational cost increases as the number of words increases, and the word vector does not sufficiently reflect the meaning of the word.

최근에는 개별 단어의 임베딩뿐 아니라 문장, 단락, 그리고 문서 전체를 임베딩하기 위한 시도도 다양한 측면에서 이루어지고 있다. 특히, 문서 임베딩에 대한 분석 수요가 급증함에 따라 이를 지원하기 위한 방법들이 다수 제안되고 있다.Recently, attempts to embed not only individual words but also sentences, paragraphs, and entire documents have been made in various aspects. In particular, as the demand for analysis on document embedding increases rapidly, a number of methods to support this have been proposed.

한국등록특허 제10-0490442(2005.05.11)호Korean Patent Registration No. 10-0490442 (2005.05.11)

본 발명의 일 실시예는 벡터 왜곡 현상을 보정하고 문서를 더욱 정확하게 임베딩할 수 있는 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a multi-vector document embedding apparatus and method through semantic decomposition of a compound document capable of correcting a vector distortion phenomenon and embedding a document more accurately.

본 발명의 일 실시예는 문서 임베딩을 수행하는 과정에서 문서에 대한 사전 지식, 즉 문서 작성자가 직접 선정한 키워드 정보를 적극 반영함으로써 문서의 핵심 용어에 집중하여 문서 벡터를 생성하는 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is a semantic decomposition of a compound document in which a document vector is generated by focusing on the core terms of the document by actively reflecting prior knowledge about the document, that is, keyword information directly selected by the document author, in the process of embedding a document. An apparatus and method for embedding a multi-vector document through

본 발명의 일 실시예는 더욱 정교한 텍스트 구조화를 수행하고 분류, 군집화, 그리고 토픽 모델링 등 다양한 텍스트 분석 결과의 품질도 향상시킬 수 있는 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법을 제공하고자 한다.An embodiment of the present invention provides an apparatus and method for embedding a multi-vector document through semantic decomposition of a compound document that can perform more sophisticated text structuring and improve the quality of various text analysis results such as classification, clustering, and topic modeling. I want to.

실시예들 중에서, 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치는 문서 집합에 포함된 대상 문서에 대한 파싱(Parsing)을 통해 모든 용어들을 토큰(Token)으로 분리하는 대상 문서 파싱부, 단어 임베딩(Embedding)을 통해 각 토큰을 단어 벡터로 변환하는 단어 임베딩부, 상기 단어 벡터 중에서 대상 문서별 키워드로 지정된 토큰의 단어 벡터를 추출하여 문서별 키워드 벡터 집합을 생성하는 키워드 벡터 추출부, 상기 문서별 키워드 벡터 집합에 대해 군집 분석(Clustering Analysis)을 수행하여 복수의 키워드 군집들을 생성하는 키워드 군집화부 및 상기 복수의 키워드 군집들 각각에 포함된 키워드 벡터를 기초로 키워드 군집별 벡터를 생성하여 상기 복수의 키워드 군집들과 연관된 대상 문서의 다중 벡터로서 결정하는 다중 벡터 생성부를 포함한다.Among embodiments, a multi-vector document embedding apparatus through semantic decomposition of a compound document is a target document parsing unit that separates all terms into tokens through parsing a target document included in a document set, words A word embedding unit that converts each token into a word vector through embedding, a keyword vector extraction unit that extracts a word vector of a token designated as a keyword for each target document from the word vectors to generate a keyword vector set for each document, the document A keyword clustering unit that generates a plurality of keyword clusters by performing clustering analysis on each keyword vector set, and a vector for each keyword cluster based on the keyword vectors included in each of the plurality of keyword clusters And a multiple vector generator that determines as multiple vectors of the target document associated with the keyword clusters of.

상기 대상 문서 파싱부는 상기 대상 문서의 키워드 집합을 상기 파싱에 사용되는 단어 사전에 추가한 후 상기 파싱을 수행할 수 있다.The target document parsing unit may perform the parsing after adding the keyword set of the target document to a word dictionary used for the parsing.

상기 대상 문서 파싱부는 상기 키워드 집합이 없는 경우 상기 대상 문서에 관한 분석을 통해 추출된 핵심 단어를 키워드로서 결정할 수 있다.When there is no keyword set, the target document parsing unit may determine a key word extracted through analysis of the target document as a keyword.

상기 단어 임베딩부는 Word2Vec을 이용하여 상기 단어 임베딩을 수행하고 각 토큰에 대해 n차원의 실수 값을 갖는 벡터를 상기 단어 벡터로서 생성할 수 있다.The word embedding unit may perform the word embedding using Word2Vec and generate a vector having an n-dimensional real value for each token as the word vector.

상기 키워드 벡터 추출부는 특정 대상 문서에 대해 키워드 집합의 각 키워드와 해당 키워드에 대한 단어 벡터 쌍을 하나의 원소로서 포함하는 키워드 벡터 집합을 생성할 수 있다.The keyword vector extractor may generate a keyword vector set including each keyword of the keyword set and a word vector pair for the corresponding keyword for a specific target document as one element.

상기 키워드 군집화부는 키워드 벡터 집합에 포함된 키워드 벡터의 n차원 실수 값을 기초로 계층적 군집 분석 또는 비 계층적 군집 분석을 적용하여 상기 군집 분석을 수행할 수 있다.The keyword clustering unit may perform the cluster analysis by applying a hierarchical cluster analysis or a non-hierarchical cluster analysis based on an n-dimensional real value of a keyword vector included in the keyword vector set.

상기 다중 벡터 생성부는 각 키워드 군집에 포함된 키워드 벡터들의 평균을 통해 상기 키워드 군집별 벡터를 생성할 수 있다.The multiple vector generator may generate a vector for each keyword cluster through an average of keyword vectors included in each keyword cluster.

상기 다중 벡터 생성부는 상기 키워드 군집별 벡터를 멤버(member) 벡터로 포함하는 멤버 벡터 집합을 생성하여 상기 다중 벡터로서 결정할 수 있다.The multiple vector generator may generate a member vector set including the vector for each keyword cluster as a member vector, and may determine the multiple vector.

실시예들 중에서, 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법은 문서 집합에 포함된 대상 문서에 대한 파싱(Parsing)을 통해 모든 용어들을 토큰(Token)으로 분리하는 단계, 단어 임베딩(Embedding)을 통해 각 토큰을 단어 벡터로 변환하는 단계, 상기 단어 벡터 중에서 대상 문서별 키워드로 지정된 토큰의 단어 벡터를 추출하여 문서별 키워드 벡터 집합을 생성하는 단계, 상기 문서별 키워드 벡터 집합에 대해 군집 분석(Clustering Analysis)을 수행하여 복수의 키워드 군집들을 생성하는 단계 및 상기 복수의 키워드 군집들 각각에 포함된 키워드 벡터를 기초로 키워드 군집별 벡터를 생성하여 상기 복수의 키워드 군집들에 관한 대상 문서의 다중 벡터로서 결정하는 단계를 포함한다.Among the embodiments, the multi-vector document embedding method through semantic decomposition of a compound document is a step of separating all terms into tokens through parsing a target document included in a document set, and word embedding. ), converting each token into a word vector, generating a keyword vector set for each document by extracting a word vector of a token designated as a keyword for each target document from the word vectors, cluster analysis for the keyword vector set for each document Generating a plurality of keyword clusters by performing (Clustering Analysis), and generating a vector for each keyword cluster based on the keyword vectors included in each of the plurality of keyword clusters to multiplex target documents related to the plurality of keyword clusters. Determining as a vector.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technology can have the following effects. However, since it does not mean that a specific embodiment should include all of the following effects or only the following effects, it should not be understood that the scope of the rights of the disclosed technology is limited thereby.

본 발명의 일 실시예에 따른 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법은 문서 임베딩을 수행하는 과정에서 문서에 대한 사전 지식, 즉 문서 작성자가 직접 선정한 키워드 정보를 적극 반영함으로써 문서의 핵심 용어에 집중하여 문서 벡터를 생성할 수 있다.The apparatus and method for embedding a multi-vector document through semantic decomposition of a compound document according to an embodiment of the present invention actively reflects prior knowledge about the document, that is, keyword information directly selected by the document author, in the process of performing document embedding. You can create document vectors by focusing on the key terms of.

본 발명의 일 실시예에 따른 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 장치 및 방법은 더욱 정교한 텍스트 구조화를 수행하고 분류, 군집화, 그리고 토픽 모델링 등 다양한 텍스트 분석 결과의 품질도 향상시킬 수 있다.The apparatus and method for embedding a multi-vector document through semantic decomposition of a compound document according to an embodiment of the present invention can perform more sophisticated text structuring and improve quality of various text analysis results such as classification, clustering, and topic modeling. .

도 1은 본 발명에 따른 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 시스템을 설명하는 도면이다.
도 2는 도 1에 있는 다중 벡터 문서 임베딩 장치의 논리적 구성을 설명하는 블록도이다.
도 3은 도 1에 있는 다중 벡터 문서 임베딩 장치에서 수행되는 다중 벡터 문서 임베딩 과정을 설명하는 순서도이다.
도 4는 본 발명에 따른 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 과정을 설명하는 도면이다.
도 5는 도 2에 있는 대상 문서 파싱부에서 수행되는 파싱 과정을 설명하는 예시도이다.
도 6은 도 2에 있는 단어 임베딩부에서 수행되는 단어 임베딩 과정을 설명하는 예시도이다.
도 7은 도 2에 있는 키워드 벡터 추출부에서 수행되는 키워드 벡터 집합 생성 과정을 설명하는 예시도이다.
도 8 및 9는 도 2에 있는 다중 벡터 생성부에서 수행되는 다중 벡터 생성 과정을 설명하는 예시도이다.
도 10은 도 9의 벡터를 3차원 공간에 시각화한 도면이다.
도 11 내지 16은 본 발명에 관한 성능 실험 과정의 일 실시예를 설명하는 도면이다.1 is a diagram illustrating a multi-vector document embedding system through semantic decomposition of a compound document according to the present invention.
FIG. 2 is a block diagram illustrating a logical configuration of the apparatus for embedding a multi-vector document in FIG. 1.
FIG. 3 is a flowchart illustrating a multi-vector document embedding process performed by the multi-vector document embedding apparatus of FIG. 1.
4 is a diagram illustrating a multi-vector document embedding process through semantic decomposition of a compound document according to the present invention.
5 is an exemplary diagram illustrating a parsing process performed by the target document parsing unit of FIG. 2.
6 is an exemplary view illustrating a word embedding process performed by the word embedding unit of FIG. 2.
FIG. 7 is an exemplary diagram illustrating a process of generating a keyword vector set performed by the keyword vector extraction unit of FIG. 2.
8 and 9 are exemplary diagrams illustrating a multi-vector generation process performed by the multi-vector generator of FIG. 2.
10 is a diagram illustrating a visualization of the vector of FIG. 9 in a three-dimensional space.
11 to 16 are diagrams illustrating an embodiment of a performance test procedure according to the present invention.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Since the description of the present invention is merely an embodiment for structural or functional description, the scope of the present invention should not be construed as being limited by the embodiments described in the text. That is, since the embodiments can be variously changed and have various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical idea. In addition, since the object or effect presented in the present invention does not mean that a specific embodiment should include all of them or only those effects, the scope of the present invention should not be understood as being limited thereto.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.Meanwhile, the meaning of terms described in the present application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as "first" and "second" are used to distinguish one component from other components, and the scope of rights is not limited by these terms. For example, a first component may be referred to as a second component, and similarly, a second component may be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" to another component, it should be understood that although it may be directly connected to the other component, another component may exist in the middle. On the other hand, when it is mentioned that a certain component is "directly connected" to another component, it should be understood that no other component exists in the middle. On the other hand, other expressions describing the relationship between the constituent elements, that is, "between" and "just between" or "neighboring to" and "directly neighboring to" should be interpreted as well.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions are to be understood as including plural expressions unless the context clearly indicates otherwise, and terms such as “comprise” or “have” refer to implemented features, numbers, steps, actions, components, parts, or It is to be understood that it is intended to designate that a combination exists and does not preclude the presence or addition of one or more other features or numbers, steps, actions, components, parts, or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In each step, the identification code (for example, a, b, c, etc.) is used for convenience of explanation, and the identification code does not describe the order of each step, and each step has a specific sequence clearly in context. Unless otherwise stated, it may occur differently from the stated order. That is, each of the steps may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be embodied as computer-readable codes on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices storing data that can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices. Further, the computer-readable recording medium is distributed over a computer system connected by a network, so that the computer-readable code can be stored and executed in a distributed manner.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the field to which the present invention belongs, unless otherwise defined. Terms defined in commonly used dictionaries should be construed as having meanings in the context of related technologies, and cannot be construed as having an ideal or excessive formal meaning unless explicitly defined in the present application.

도 1은 본 발명에 따른 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 시스템을 설명하는 도면이다.1 is a diagram illustrating a multi-vector document embedding system through semantic decomposition of a compound document according to the present invention.

도 1을 참조하면, 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 시스템(100)은 사용자 단말(110), 다중 벡터 문서 임베딩 장치(130) 및 데이터베이스(150)를 포함할 수 있다.Referring to FIG. 1, a multi-vector document embedding system 100 through semantic decomposition of a compound document may include a user terminal 110, a multi-vector document embedding device 130, and a database 150.

사용자 단말(110)은 특정 문서를 제공하고 해당 문서에 대한 다중 벡터 표현을 확인할 수 있는 컴퓨팅 장치에 해당할 수 있고, 스마트폰, 노트북 또는 컴퓨터로 구현될 수 있으며, 반드시 이에 한정되지 않고, 태블릿 PC 등 다양한 디바이스로도 구현될 수 있다. 사용자 단말(110)은 다중 벡터 문서 임베딩 장치(130)와 네트워크를 통해 연결될 수 있고, 복수의 사용자 단말(110)들은 다중 벡터 문서 임베딩 장치(130)와 동시에 연결될 수 있다. The user terminal 110 may correspond to a computing device capable of providing a specific document and confirming a multi-vector representation of the document, and may be implemented as a smartphone, laptop, or computer, but is not necessarily limited thereto, and a tablet PC It can be implemented in various devices such as. The user terminal 110 may be connected to the multi-vector document embedding apparatus 130 through a network, and the plurality of user terminals 110 may be connected to the multi-vector document embedding apparatus 130 at the same time.

다중 벡터 문서 임베딩 장치(130)는 문서에 대한 문서 임베딩을 통해 다중 벡터로 표현할 수 있는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 다중 벡터 문서 임베딩 장치(130)는 사용자 단말(110)과 블루투스, WiFi, 통신망 등을 통해 무선으로 연결될 수 있고, 네트워크를 통해 사용자 단말(110)과 데이터를 주고 받을 수 있다.The multi-vector document embedding apparatus 130 may be implemented as a computer or a server corresponding to a program capable of expressing multiple vectors through document embedding for a document. The multi-vector document embedding device 130 may be wirelessly connected to the user terminal 110 through Bluetooth, WiFi, a communication network, and the like, and may exchange data with the user terminal 110 through a network.

일 실시예에서, 다중 벡터 문서 임베딩 장치(130)는 데이터베이스(150)와 연동하여 문서 임베딩 과정에서 필요한 정보를 저장할 수 있다. 한편, 다중 벡터 문서 임베딩 장치(130)는 도 1과 달리, 데이터베이스(150)를 내부에 포함하여 구현될 수 있다. 또한, 다중 벡터 문서 임베딩 장치(130)는 기본적인 물리적 구성으로서 프로세서, 메모리, 사용자 입출력부 및 네트워크 입출력부를 포함하여 구현될 수 있다.In an embodiment, the multi-vector document embedding apparatus 130 may store information necessary in the document embedding process in conjunction with the database 150. Meanwhile, unlike FIG. 1, the multi-vector document embedding apparatus 130 may be implemented by including the database 150 therein. In addition, the multi-vector document embedding apparatus 130 may be implemented as a basic physical configuration including a processor, a memory, a user input/output unit, and a network input/output unit.

데이터베이스(150)는 문서 임베딩을 수행하는 과정에서 필요한 다양한 정보들을 저장하는 저장장치에 해당할 수 있다. 데이터베이스(150)는 문서 임베딩의 대상이 되는 다양한 주제의 문서 집합을 저장할 수 있고, 문서로부터 추출된 단어들 및 단어 임베딩을 통해 도출되는 단어 벡터들을 저장할 수 있으며, 반드시 이에 한정되지 않고, 다중 벡터 문서 임베딩 장치(130)가 복합 문서에 관한 의미적 분해를 통해 다중 벡터 문서 임베딩을 수행하여 그 결과를 생성하는 과정에서 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다.The database 150 may correspond to a storage device that stores various pieces of information necessary in a process of embedding a document. The database 150 may store a set of documents of various subjects subject to document embedding, and may store words extracted from the document and word vectors derived through word embedding, but is not necessarily limited thereto, and a multi-vector document While the embedding device 130 performs multi-vector document embedding through semantic decomposition on the compound document and generates the result, information collected or processed in various forms may be stored.

도 2는 도 1에 있는 다중 벡터 문서 임베딩 장치의 논리적 구성을 설명하는 블록도이다.FIG. 2 is a block diagram illustrating a logical configuration of the apparatus for embedding a multi-vector document in FIG. 1.

도 2를 참조하면, 다중 벡터 문서 임베딩 장치(130)는 대상 문서 파싱부(210), 단어 임베딩부(220), 키워드 벡터 추출부(230), 키워드 군집화부(240), 다중 벡터 생성부(250) 및 제어부(260)를 포함할 수 있다.2, the multi-vector document embedding device 130 includes a target document parsing unit 210, a word embedding unit 220, a keyword vector extracting unit 230, a keyword clustering unit 240, and a multi-vector generating unit ( 250) and a control unit 260 may be included.

대상 문서 파싱부(210)는 문서 집합에 포함된 대상 문서에 대한 파싱(Parsing)을 통해 모든 용어들을 토큰(Token)으로 분리할 수 있다. 대상 문서 파싱부(210)는 파싱을 통해 대상 문서를 구성하는 모든 용어를 토큰 단위로 분리할 수 있다. 여기에서, 토큰(Token)이란 의미를 가진 최소 단위인 형태소를 의미할 수 있다. The target document parsing unit 210 may separate all terms into tokens by parsing the target document included in the document set. The target document parsing unit 210 may separate all terms constituting the target document into token units through parsing. Here, a token may mean a morpheme that is the smallest unit with meaning.

일 실시예에서, 대상 문서 파싱부(210)는 대상 문서의 키워드 집합을 파싱에 사용되는 단어 사전에 추가한 후 파싱을 수행할 수 있다. 대상 문서 파싱부(210)는 일반적인 텍스트 분석처럼 형태소 분석 결과로 도출된 토큰을 그대로 사용하여 구조화를 수행할 수 있고, 형태소 분석 과정에서 각 문서별 키워드 집합을 참조하여 구조화를 수행할 수 있다.In an embodiment, the target document parsing unit 210 may perform parsing after adding a keyword set of the target document to a word dictionary used for parsing. The target document parsing unit 210 may perform structuring by using the token derived from the result of morpheme analysis as it is, as in general text analysis, and may perform structuring by referring to keyword sets for each document in the morpheme analysis process.

보다 구체적으로, 다중 벡터 문서 임베딩 장치(130)의 핵심적인 동작 중 하나는 문서 내 단어 중 키워드에 해당하는 단어를 벡터로 표현하는 것인데, 이러한 키워드에는 단순 명사뿐 아니라 복합 명사도 다수 포함될 수 있다. 일반적으로 사용되는 한국어 형태소 분석기는 일부 관용어나 고유 명사를 제외한 복합 명사는 토큰으로 분리해내지 못한다는 한계를 가지고 있으며, 예를 들어, 문서의 키워드에는 '문서 임베딩'이라는 어휘가 포함된 경우라고 할지라도 본문에 대한 형태소 분석을 통해 획득한 토큰 집합에는 '문서'와 '임베딩'만 존재할 뿐 '문서 임베딩'이라는 어휘는 존재하지 않을 수 있다.More specifically, one of the key operations of the multi-vector document embedding apparatus 130 is to express a word corresponding to a keyword among words in a document as a vector, and such keywords may include a number of complex nouns as well as simple nouns. The commonly used Korean morpheme analyzer has a limitation that it is not possible to separate compound nouns with tokens except for some idioms or proper nouns. For example, a keyword in a document contains the vocabulary of'document embedding'. Only'document' and'embedding' exist in the token set obtained through morphological analysis of the text of the figure, but the vocabulary of'document embedding' may not exist.

따라서, 대상 문서 파싱부(210)는 대상 문서에 대한 파싱의 전처리 단계에서 키워드 집합을 파싱에 사용되는 단어 사전에 추가할 수 있고, 이러한 과정을 통해 보다 정확한 파싱 결과를 도출할 수 있다.Accordingly, the target document parsing unit 210 may add a keyword set to the word dictionary used for parsing in a preprocessing step of parsing the target document, and through this process, a more accurate parsing result may be derived.

일 실시예에서, 대상 문서 파싱부(210)는 키워드 집합이 없는 경우 대상 문서에 관한 분석을 통해 추출된 핵심 단어를 키워드로서 결정할 수 있다. 대상 문서 파싱부(210)는 대상 문서의 파싱 과정에서 키워드 집합을 적용할 수 있고, 만약 대상 문서에 대한 키워드 집합이 정의되지 않은 경우에는 대상 문서의 분석을 통해 키워드 집합을 생성할 수 있다. 대상 문서 파싱부(210)는 키워드 집합 생성을 위하여 기존에 알려진 다양한 분석 방법을 사용할 수 있다.In an embodiment, when there is no keyword set, the target document parsing unit 210 may determine, as a keyword, a key word extracted through analysis of the target document. The target document parsing unit 210 may apply the keyword set in the process of parsing the target document, and if the keyword set for the target document is not defined, it may generate the keyword set through analysis of the target document. The target document parsing unit 210 may use various previously known analysis methods to generate a keyword set.

단어 임베딩부(220)는 단어 임베딩(Embedding)을 통해 각 토큰을 단어 벡터로 변환할 수 있다. 일 실시예에서, 단어 임베딩부(220)는 Word2Vec을 이용하여 단어 임베딩을 수행하고 각 토큰에 대해 n차원의 실수 값을 갖는 벡터를 단어 벡터로서 생성할 수 있다. 단어 임베딩부(220)는 다양한 단어 임베딩 방법을 적용할 수 있으며, 다만 현재 사용되고 있는 단어 임베딩 방법 중 단어의 특징(Features)을 가장 잘 반영하는 것으로 알려진 Word2Vec 알고리즘을 통해 토큰을 벡터로 변환할 수 있다. The word embedding unit 220 may convert each token into a word vector through word embedding. In one embodiment, the word embedding unit 220 may perform word embedding using Word2Vec and generate a vector having an n-dimensional real value for each token as a word vector. The word embedding unit 220 may apply various word embedding methods, but may convert the token into a vector through the Word2Vec algorithm, which is known to best reflect the features of words among the currently used word embedding methods. .

여기에서, Word2Vec 알고리즘은 단어를 벡터로 표현하는 방법에 해당할 수 있고, CBOW(Continuous Bag of Words) 방식과 Skip-Gram 방식으로 구현될 수 있다. Word2Vec을 통해 출력되는 단어 벡터는 n차원의 실수 값을 가질 수 있고, 여기에서 n은 각 단어가 가질 수 있는 특질(Features)의 수에 해당할 수 있다. 따라서, 단어 임베딩부(220)에 의해 변환된 단어 벡터는 n차원의 공간 상으로 투사될 수 있다. Here, the Word2Vec algorithm may correspond to a method of representing a word as a vector, and may be implemented in a Continuous Bag of Words (CBOW) method and a Skip-Gram method. A word vector output through Word2Vec may have an n-dimensional real value, where n may correspond to the number of features each word can have. Accordingly, the word vector converted by the word embedding unit 220 may be projected onto an n-dimensional space.

키워드 벡터 추출부(230)는 단어 벡터 중에서 대상 문서별 키워드로 지정된 토큰의 단어 벡터를 추출하여 문서별 키워드 벡터 집합을 생성할 수 있다. 즉, 키워드 벡터 추출부(230)는 문서에 포함된 모든 단어들 중에서 각 문서의 키워드 만을 추출하고 키워드에 대한 단어 벡터만으로 구성되는 키워드 벡터 집합을 생성할 수 있다. 문서 별로 키워드 집합이 독립적으로 정의될 수 있기 때문에 문서 별로 키워드 벡터 집합이 독립적으로 정의될 수 있다.The keyword vector extractor 230 may generate a keyword vector set for each document by extracting a word vector of a token designated as a keyword for each target document from among the word vectors. That is, the keyword vector extracting unit 230 may extract only a keyword of each document from among all words included in the document and generate a keyword vector set consisting of only word vectors for the keyword. Since a keyword set can be independently defined for each document, a keyword vector set can be independently defined for each document.

일 실시예에서, 키워드 벡터 추출부(230)는 특정 대상 문서에 대해 키워드 집합의 각 키워드와 해당 키워드에 대한 단어 벡터 쌍을 하나의 원소로서 포함하는 키워드 벡터 집합을 생성할 수 있다. 키워드 벡터 집합은 문서 별로 정의될 수 있고, 해당 문서의 키워드 집합에 속한 각 키워드와 해당 키워드의 단어 벡터 쌍을 원소로 포함할 수 있다.In an embodiment, the keyword vector extraction unit 230 may generate a keyword vector set including each keyword of the keyword set and a word vector pair for the corresponding keyword as one element for a specific target document. The keyword vector set may be defined for each document, and each keyword belonging to the keyword set of a corresponding document and a word vector pair of the corresponding keyword may be included as elements.

키워드 군집화부(240)는 문서별 키워드 벡터 집합에 대해 군집 분석(Clustering Analysis)을 수행하여 복수의 키워드 군집들을 생성할 수 있다. 여기에서, 군집 분석(Clustering Analysis)은 각 개체의 유사성을 측정하여 높은 대상 집단을 분류하고, 군집에 속한 개체들의 유사성과 서로 다른 군집에 속한 개체 간의 상이성을 규명하는 통계 분석 방법에 해당할 수 있다. 즉, 키워드 군집화부(240)는 군집 분석을 통해 유사한 주제를 갖는 키워드들로 묶을 수 있다.The keyword clustering unit 240 may generate a plurality of keyword clusters by performing clustering analysis on a keyword vector set for each document. Here, clustering analysis may correspond to a statistical analysis method that classifies a high target group by measuring the similarity of each entity, and identifies the similarity of entities in the cluster and differences between entities in different clusters. have. That is, the keyword clustering unit 240 may group keywords having similar subjects through cluster analysis.

일 실시예에서, 키워드 군집화부(240)는 키워드 벡터 집합에 포함된 키워드 벡터의 n차원 실수 값을 기초로 계층적 군집 분석 또는 비 계층적 군집 분석을 적용하여 군집 분석을 수행할 수 있다. 군집 분석은 대상을 분석하는 방법에 따라 분류될 수 있고, 보다 구체적으로 군집 대상의 중복 여부 및 자료의 크기를 기준으로 계층적 군집 분석(Hierarchical Clustering Method)과 비계층적 군집 분석(Non-Hierarchical Clustering Method)으로 분류될 수 있다.In an embodiment, the keyword clustering unit 240 may perform cluster analysis by applying a hierarchical cluster analysis or a non-hierarchical cluster analysis based on an n-dimensional real value of a keyword vector included in the keyword vector set. Cluster analysis can be classified according to the method of analyzing the target, and more specifically, hierarchical clustering method and non-hierarchical clustering based on the size of data and whether or not the cluster targets overlap. Method).

다중 벡터 생성부(250)는 복수의 키워드 군집들 각각에 포함된 키워드 벡터를 기초로 키워드 군집별 벡터를 생성하여 상기 복수의 키워드 군집들에 연관된 대상 문서의 다중 벡터로서 결정할 수 있다. 즉, 하나의 대상 문서에 대응하여 키워드 집합이 정의될 수 있고, 해당 키워드 집합은 군집 분석에 의해 복수의 키워드 군집들로 분할될 수 있다. 분할된 각 키워드 군집 별로 벡터가 정의될 수 있고, 결과적으로 하나의 대상 문서에 대해 복수의 벡터들이 정의될 수 있다.The multiple vector generation unit 250 may generate a vector for each keyword cluster based on the keyword vectors included in each of the plurality of keyword clusters, and determine the multiple vectors of the target document related to the plurality of keyword clusters. That is, a keyword set may be defined corresponding to one target document, and the keyword set may be divided into a plurality of keyword clusters by cluster analysis. A vector may be defined for each divided keyword cluster, and as a result, a plurality of vectors may be defined for one target document.

일 실시예에서, 다중 벡터 생성부(250)는 각 키워드 군집에 포함된 키워드 벡터들의 평균을 통해 키워드 군집별 벡터를 생성할 수 있다. 다중 벡터 생성부(250)는 키워드에 대응되는 단어 벡터의 n차원 실수 값을 이용하여 키워드 군집에 대한 벡터를 생성할 수 있으며, 각 단어 벡터들로부터 추출된 동일 차원의 실수 값에 대해 산술적인 연산을 적용함으로써 키워드 군집에 대한 벡터를 생성할 수 있다.In an embodiment, the multiple vector generator 250 may generate a vector for each keyword cluster through an average of keyword vectors included in each keyword cluster. The multiple vector generation unit 250 may generate a vector for a keyword cluster by using the n-dimensional real value of a word vector corresponding to the keyword, and an arithmetic operation on real values of the same dimension extracted from each word vector A vector for the keyword cluster can be created by applying.

일 실시예에서, 다중 벡터 생성부(250)는 키워드 군집별 벡터를 멤버(member) 벡터로 포함하는 멤버 벡터 집합을 생성하여 다중 벡터로서 결정할 수 있다. 여기에서, 멤버(member) 벡터는 하나의 대상 문서에 대한 다중 벡터를 구성하는 벡터에 해당할 수 있고, 해당 대상 문서의 키워드 중에서 군집화된 키워드를 기초로 생성될 수 있다. 결과적으로, 다중 벡터 생성부(250)는 특정 대상 문서를 표현하는 다중 벡터를 생성할 수 있고, 해당 다중 벡터는 적어도 하나의 멤버 벡터들로 구성될 수 있다.In an embodiment, the multiple vector generation unit 250 may generate a member vector set including a vector for each keyword cluster as a member vector and determine the multiple vector. Here, the member vector may correspond to a vector constituting multiple vectors for one target document, and may be generated based on keywords clustered among keywords of the target document. As a result, the multi-vector generator 250 may generate a multi-vector representing a specific target document, and the multi-vector may be composed of at least one member vector.

제어부(260)는 다중 벡터 문서 임베딩 장치(130)의 전체적인 동작을 제어하고, 대상 문서 파싱부(210), 단어 임베딩부(220), 키워드 벡터 추출부(230), 키워드 군집화부(240) 및 다중 벡터 생성부(250) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The control unit 260 controls the overall operation of the multi-vector document embedding device 130, and the target document parsing unit 210, the word embedding unit 220, the keyword vector extracting unit 230, the keyword clustering unit 240, and A control flow or data flow between the multi-vector generator 250 may be managed.

도 3은 도 1에 있는 다중 벡터 문서 임베딩 장치에서 수행되는 다중 벡터 문서 임베딩 과정을 설명하는 순서도이다.3 is a flowchart illustrating a multi-vector document embedding process performed by the multi-vector document embedding apparatus in FIG. 1.

도 3을 참조하면, 다중 벡터 문서 임베딩 장치(130)는 대상 문서 파싱부(210)를 통해 문서 집합에 포함된 대상 문서에 대한 파싱(Parsing)을 통해 모든 용어들을 토큰(Token)으로 분리할 수 있다(단계 S310). 다중 벡터 문서 임베딩 장치(130)는 단어 임베딩부(220)를 통해 단어 임베딩(Embedding)을 통해 각 토큰을 단어 벡터로 변환할 수 있다(단계 S330). 다중 벡터 문서 임베딩 장치(130)는 키워드 벡터 추출부(230)를 통해 단어 벡터 중에서 대상 문서별 키워드로 지정된 토큰의 단어 벡터를 추출하여 문서별 키워드 벡터 집합을 생성할 수 있다(단계 S350).Referring to FIG. 3, the multi-vector document embedding device 130 may separate all terms into tokens through parsing for a target document included in a document set through the target document parsing unit 210. Yes (step S310). The multi-vector document embedding apparatus 130 may convert each token into a word vector through word embedding through the word embedding unit 220 (step S330). The multi-vector document embedding apparatus 130 may generate a keyword vector set for each document by extracting a word vector of a token designated as a keyword for each target document from among the word vectors through the keyword vector extractor 230 (step S350).

또한, 다중 벡터 문서 임베딩 장치(130)는 키워드 군집화부(240)를 통해 문서별 키워드 벡터 집합에 대해 군집 분석(Clustering Analysis)을 수행하여 복수의 키워드 군집들을 생성할 수 있다(단계 S370). 다중 벡터 문서 임베딩 장치(130)는 다중 벡터 생성부(250)를 통해 복수의 키워드 군집들 각각에 포함된 키워드 벡터를 기초로 키워드 군집별 벡터를 생성하여 복수의 키워드 군집들에 관한 대상 문서의 다중 벡터로서 결정할 수 있다(단계 S390).Further, the multi-vector document embedding apparatus 130 may generate a plurality of keyword clusters by performing clustering analysis on a keyword vector set for each document through the keyword clustering unit 240 (step S370). The multi-vector document embedding apparatus 130 generates a vector for each keyword group based on the keyword vectors included in each of the plurality of keyword clusters through the multi-vector generator 250 to multiplex target documents for the plurality of keyword clusters. It can be determined as a vector (step S390).

도 4는 본 발명에 따른 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 과정을 설명하는 도면이다.4 is a diagram illustrating a multi-vector document embedding process through semantic decomposition of a compound document according to the present invention.

도 4를 참조하면, 다중 벡터 문서 임베딩 장치(130)는 본문 내용과 키워드가 명시적으로 구분되어 있는 문서를 대상으로 동작할 수 있다. 물론 키워드가 명시적으로 제공되지 않는 문서의 경우 다양한 분석 방법을 통해 본문의 핵심 단어를 도출하여 이를 키워드로 정의하는 전처리 단계를 수행할 수 있다.Referring to FIG. 4, the multi-vector document embedding apparatus 130 may operate on a document in which body content and keywords are explicitly separated. Of course, in the case of documents for which keywords are not explicitly provided, the pre-processing step of deriving key words in the body through various analysis methods and defining them as keywords can be performed.

다중 벡터 문서 임베딩 장치(130)는 (1) 파싱(Parsing), (2) 단어 임베딩, (3) 키워드 벡터 도출, (4) 키워드 군집화, 그리고 (5) 다중 벡터 생성의 주요 모듈로 구성될 수 있다. 파싱 단계에서는 대상 문서에 대한 형태소 분석을 통해 모든 용어들을 토큰(Token)으로 분리할 수 있다. 다음으로 단어 임베딩을 통해 각 토큰을 N차원의 실수 값을 가진 벡터로 변환한 후, 이들 벡터 중 각 문서별 키워드로 지정된 토큰의 단어 벡터만을 추출하여 문서별 키워드 벡터 집합을 구성할 수 있다. 다음으로 문서에 포함된 복합 주제를 식별하기 위해 각 문서별 키워드 집합에 대해 군집 분석을 수행하고, 마지막으로 각 군집을 구성하는 키워드들의 벡터로부터 군집별 벡터를 도출할 수 있다.The multi-vector document embedding device 130 may be composed of main modules of (1) parsing, (2) word embedding, (3) keyword vector derivation, (4) keyword clustering, and (5) multiple vector generation. have. In the parsing step, all terms can be separated into tokens through morphological analysis of the target document. Next, after converting each token into a vector having an N-dimensional real value through word embedding, only the word vector of the token designated as the keyword for each document is extracted from among these vectors to construct a keyword vector set for each document. Next, cluster analysis is performed on a set of keywords for each document to identify complex topics included in the document, and finally, a vector for each cluster can be derived from the vector of keywords constituting each cluster.

도 5는 도 2에 있는 대상 문서 파싱부에서 수행되는 파싱 과정을 설명하는 예시도이다.5 is an exemplary diagram illustrating a parsing process performed by the target document parsing unit of FIG. 2.

도 5를 참조하면, 다중 벡터 문서 임베딩 장치(130)는 대상 문서 파싱부(210)를 통해 문서 집합에 포함된 대상 문서에 대한 파싱을 통해 모든 용어들을 토큰으로 분리할 수 있다.Referring to FIG. 5, the multi-vector document embedding apparatus 130 may separate all terms into tokens by parsing a target document included in a document set through the target document parsing unit 210.

그림 (a)는 각 문서별 키워드 집합을 나타내며, 그림 (b)는 키워드 집합을 고려하지 않은 일반 형태소 분석 결과를 나타낸다. 한편, 그림 (c)는 키워드 집합에 포함된 복합 명사를 고려한 형태소 분석 결과를 나타낸다. 그림 (c)에 나타난 “관절 안정화”, “장거리 달리기”, 그리고 “인터벌 트레이닝” 등의 어휘는 일반적으로 형태소 분석에 사용되는 범용 사전에는 수록되어 있지 않은 복합 명사이다. 따라서, 이러한 어휘를 토큰으로 관리하기 위해서는 형태소 분석 단계에서 이들 어휘를 사전에 추가할 필요가 있다.Figure (a) shows the keyword set for each document, and Figure (b) shows the result of general morpheme analysis without considering the keyword set. On the other hand, figure (c) shows the result of morpheme analysis considering compound nouns included in the keyword set. Vocabularies such as “joint stabilization”, “long distance running”, and “interval training” shown in Fig. (c) are compound nouns that are not included in general-purpose dictionaries generally used for morpheme analysis. Therefore, in order to manage these vocabularies as tokens, it is necessary to add these vocabularies to the dictionary in the morpheme analysis stage.

도 6은 도 2에 있는 단어 임베딩부에서 수행되는 단어 임베딩 과정을 설명하는 예시도이다.6 is an exemplary diagram illustrating a word embedding process performed in the word embedding unit of FIG. 2.

도 6을 참조하면, 다중 벡터 문서 임베딩 장치(130)는 단어 임베딩부(220)를 통해 각 토큰을 실수 값을 벡터로 구조화할 수 있다. 특히, 단어 임베딩부(220)는 현재 사용되고 있는 단어 임베딩 방법 중 단어의 특질(Features)을 가장 잘 반영하는 것으로 알려진 word2Vec 알고리즘을 통해 토큰을 벡터로 변환할 수 있고, 도 6에서는 토큰 집합의 각 토큰을 n차원 벡터로 변환한 예를 확인할 수 있다.Referring to FIG. 6, the multi-vector document embedding device 130 may structure each token into a vector using a real value through the word embedding unit 220. In particular, the word embedding unit 220 can convert the token into a vector through the word2Vec algorithm, which is known to best reflect the features of the word among the currently used word embedding methods. You can see an example of converting to an n-dimensional vector.

도 7은 도 2에 있는 키워드 벡터 추출부에서 수행되는 키워드 벡터 집합 생성 과정을 설명하는 예시도이다.FIG. 7 is an exemplary diagram illustrating a process of generating a keyword vector set performed by the keyword vector extraction unit in FIG. 2.

도 7을 참조하면, 다중 벡터 문서 임베딩 장치(130)는 키워드 벡터 추출부(230)를 통해 단어 임베딩부(220)에 의해 생성된 단어 벡터 중에서 대상 문서별 키워드로 지정된 토큰의 단어 벡터를 추출하여 문서별 키워드 벡터 집합을 생성할 수 있다.Referring to FIG. 7, the multi-vector document embedding device 130 extracts a word vector of a token designated as a keyword for each target document from word vectors generated by the word embedding unit 220 through the keyword vector extraction unit 230 You can create a set of keyword vectors for each document.

예를 들어, 도 6의 그림 (b)에 나타난 단어 벡터의 경우 키워드뿐 아니라 주변 단어의 벡터도 함께 포함하고 있다. 키워드 벡터 추출부(230)에 의해 수행되는 키워드 벡터 도출 과정은 이들 단어 벡터 중 각 문서별 키워드로 명시된 어휘의 벡터만을 추출하는 과정이며, 그 예가 도 7에 제시되어 있다. 도 6의 그림 (b)에서, '활용', '운동', '개발', 그리고 '검증' 등의 토큰은 도 7의 그림 (a)의 키워드 집합에 포함되어 있지 않기 때문에 도 7의 그림 (b)의 키워드 벡터 집합에는 나타나지 않음을 확인할 수 있다. 다중 벡터 문서 임베딩 장치(130)는 도 7의 그림 (b)에 나타난 키워드 벡터만을 문서 임베딩에 활용할 수 있다.For example, the word vector shown in Fig. 6 (b) includes not only keywords but also vectors of surrounding words. The keyword vector derivation process performed by the keyword vector extraction unit 230 is a process of extracting only a vector of a vocabulary specified as a keyword for each document from among these word vectors, an example of which is shown in FIG. 7. In Figure 6 (b), tokens such as'use','exercise','development', and'verify' are not included in the keyword set in Figure 7 (a). It can be seen that it does not appear in the keyword vector set of b). The multi-vector document embedding apparatus 130 may utilize only the keyword vector shown in Fig. 7 (b) for document embedding.

도 8 및 9는 도 2에 있는 다중 벡터 생성부에서 수행되는 다중 벡터 생성 과정을 설명하는 예시도이다.8 and 9 are exemplary diagrams illustrating a multi-vector generation process performed by the multi-vector generator of FIG. 2.

다중 벡터 문서 임베딩 장치(130)는 키워드 군집화부(240)를 통해 문서별 키워드 벡터 집합에 대해 군집 분석(Clustering Analysis)을 수행하여 복수의 키워드 군집들을 생성할 수 있고, 다중 벡터 생성부(250)를 통해 복수의 키워드 군집들 각각에 포함된 키워드 벡터를 기초로 키워드 군집별 벡터를 생성할 수 있다.The multi-vector document embedding apparatus 130 may generate a plurality of keyword clusters by performing clustering analysis on a keyword vector set for each document through the keyword clustering unit 240, and the multi-vector generating unit 250 A vector for each keyword cluster may be generated based on a keyword vector included in each of the plurality of keyword clusters.

다양한 주제를 복합적으로 포함하고 있는 문서의 경우, 임베딩의 결과로 나타난 벡터가 이들 주제 각각을 정확하게 나타낼 것을 기대하기란 매우 어렵다. 예를 들어, 도 7에 나타난 문서 Doc₁는 'IT' 주제의 키워드인 '어플리케이션'과 '자이로센서'를 포함하고 있으며, 이와 동시에 'Medical' 주제의 키워드인 '관절 안정화', '균형감각', '반사신경', 그리고 '생체역학'을 포함하고 있음을 알 수 있다. 하지만, 이들 키워드 전체를 아우르는 문서를 하나의 벡터로 표현하는 경우, 예를 들어 전체 키워드 벡터의 평균을 문서 벡터로 사용하는 경우, 이 벡터는 'IT' 주제와도 거리가 멀고 'Medical' 주제와도 거리가 먼 곳에 사상될 수 있다. 이러한 현상은 다른 문서들에서도 유사하게 나타나며, 그 결과 대부분의 문서는 각자 포함하고 있는 세부 주제의 상이함에도 불구하고 서로 유사한 공간에 사상되는 경향을 갖는다.In the case of a document containing a variety of topics in a complex manner, it is very difficult to expect that the vector resulting from embedding accurately represents each of these topics. For example, document Doc ₁ shown in FIG. 7 includes'application'and'gyrosensor', which are keywords of'IT', and at the same time,'joint stabilization'and'sense of balance' are keywords of'medical'. ,'Reflex nerve', and'biomechanics'. However, when a document encompassing all of these keywords is expressed as a single vector, for example, when the average of all keyword vectors is used as a document vector, this vector is far from the'IT' topic and is far from the'Medical' topic. It can also be mapped to a long distance. These phenomena appear similarly in other documents, and as a result, most documents tend to map in similar spaces despite the differences in the detailed topics they contain.

다중 벡터 문서 임베딩 장치(130)는 각 문서를 구성하고 있는 주제의 수에 따라 복수개의 벡터로 표현할 수 있다. 예를 들어, Doc₁의 경우 키워드 집합이 두 개의 그룹으로 분리된다면 다중 벡터 문서 임베딩 장치(130)는 해당 문서를 두개의 멤버 벡터(Member Vector)로 표현할 수 있다.The multi-vector document embedding apparatus 130 may represent a plurality of vectors according to the number of subjects constituting each document. For example, in the case of Doc ₁ , if the keyword set is divided into two groups, the multi-vector document embedding apparatus 130 may express the corresponding document as two member vectors.

다중 벡터 문서 임베딩 장치(130)는 도 7의 그림 (b)의 키워드 벡터에 대해 문서별 군집화를 수행하여 각 문서를 구성하고 있는 키워드 집합을 각각 두 개의 그룹으로 분리할 수 있고, 각 문서의 멤버 벡터를 도출한 예가 도 8에 나타나 있다. 단, 도 8에서 'IT', 'Medical', 그리고 'Sports' 등의 주제명은 설명의 편의를 위해 임의로 삽입한 것이다. 예를 들어, Doc1을 단일 벡터로 표현한다면 Doc1을 구성하고 있는 키워드의 전체 평균을 사용하여 Doc₁ = (-0.0058, -0.0084, …, 0.0497)로 표현될 수 있다. The multi-vector document embedding apparatus 130 can divide the keyword sets constituting each document into two groups by performing clustering for each document on the keyword vector of FIG. 7 (b), and each document member An example of deriving a vector is shown in FIG. 8. However, in FIG. 8, subject names such as'IT','Medical', and'Sports' are arbitrarily inserted for convenience of explanation. For example, if Doc1 is expressed as a single vector, it can be expressed as Doc ₁ = (-0.0058, -0.0084, …, 0.0497) by using the overall average of keywords constituting Doc1.

하지만, 다중 벡터 문서 임베딩 장치(130)는 해당 문서를 두 개의 멤버 벡터를 이용하여 Doc₁ ¹ = (0.0179, - 0.0702, …, 0.0177)과 Doc₁ ² = (-0.0177, 0.0225, …, 0.0657)로 표현한다. 도 9는 도 8의 세 문서를 단일 벡터로 표현하는 경우와 멀티 벡터로 표현하는 경우를 비교하고 있다. 도 9에서 단일 벡터로는 명시적으로 확인되지 않는 세 문서의 관계가 다중 벡터 표현으로는 상세히 확인됨을 알 수 있다. However, the multi-vector document embedding device 130 uses two member vectors for the document to be Doc ₁ ¹ = (0.0179,-0.0702, …, 0.0177) and Doc ₁ ² = (-0.0177, 0.0225, …, 0.0657). Expressed as FIG. 9 compares the case of expressing the three documents of FIG. 8 as a single vector and the case of expressing a multi-vector. In FIG. 9, it can be seen that the relationship between three documents that are not explicitly identified with a single vector is confirmed in detail with a multi-vector representation.

도 10은 이를 직관적으로 확인하기 위하여 도 9의 벡터를 3차원 공간에 시각화한 도면이다.10 is a diagram illustrating the vector of FIG. 9 visualized in a three-dimensional space in order to intuitively confirm this.

도 10에서, 벡터를 (D₁, D₂, D_N)의 3차원 공간에 시각화한 것으로 각 문서는 사각형으로, 각 문서에 포함된 키워드는 삼각형으로 표시되었다. 도 10의 그림 (a)는 각 문서를 하나의 벡터로 표현한 예로, 세 개의 문서는 각기 포함하고 있는 주제가 상이함에도 불구하고 서로 인접한 공간에 사상되었다. 한편, 도 10의 그림 (b)는 각 문서를 구성하고 있는 멤버 벡터를 원으로 함께 표현한 예이다. 예를 들어, Doc₁은 단일 벡터 표현에서 전체 공간의 중간 영역에 하나의 벡터로 나타나지만, 다중 벡터 표현으로는 좌측 상단의 Doc₁ ¹과 하단의 Doc₁ ²의 두 가지 벡터로 나타남을 알 수 있다. 이러한 다중 벡터 표현을 통해 단일 벡터 표현에서는 명확하게 확인할 수 없었던 세 문서간 관계인 Doc₁ ¹과 Doc₃ ¹의 유사성, Doc₁ ²와 Doc₂ ¹의 유사성, 그리고 Doc₂ ²와 Doc₃ ²의 유사성을 확인할 수 있다.In FIG. 10, a vector is visualized in a three-dimensional space of (D ₁ , D ₂ , D _N ), and each document is represented by a rectangle, and keywords included in each document are represented by a triangle. Figure 10 (a) is an example in which each document is expressed as a single vector, and the three documents are mapped in a space adjacent to each other despite the different subject matter. Meanwhile, figure (b) of FIG. 10 is an example in which member vectors constituting each document are represented together with a circle. For example, Doc ₁ appears as one vector in the middle area of the entire space in a single vector representation, but in a multi-vector representation, it can be seen that it appears as two vectors, Doc ₁ ¹ at the top left and Doc ₁ ² at the bottom. . Through this multi-vector expression, the similarity between Doc ₁ ¹ and Doc ₃ ^1, the relationship between three documents that could not be clearly identified in the single vector expression, the similarity between Doc ₁ ² and Doc ₂ ¹ , and the similarity between Doc ₂ ² and Doc ₃ ² I can confirm.

4. 실험4. Experiment

4.1. 실험 개요4.1. Experiment outline

본 장에서는 본 발명의 유용성을 평가하기 위한 실험 수행 과정 및 결과를 요약한다. 본 발명에서는 문서를 다중 벡터로 나타내는 새로운 임베딩 방안을 제시하였다. 하지만 다양한 임베딩 방법 중 어떤 방법에 의해 도출된 벡터가 원본 문서를 더 정확하게 나타내는지 평가할 수 있는 직접적인 기준은 존재하지 않는다. 따라서 본 장에서는 임베딩 결과의 활용 측면에서 다양한 임베딩 방법론을 통해 도출된 문서 벡터의 품질을 간접적으로 평가한다. This chapter summarizes the process and results of the experiment to evaluate the usefulness of the present invention. In the present invention, a new embedding scheme for representing documents as multiple vectors is proposed. However, there is no direct criterion for evaluating whether vectors derived by which of the various embedding methods represent the original document more accurately. Therefore, this chapter indirectly evaluates the quality of document vectors derived through various embedding methodologies in terms of using embedding results.

구체적으로는 카테고리가 식별되어 있는 문서들에 대해 다양한 방식으로 문서 임베딩을 수행하고, 그 결과 각 문서와 유사한 것으로 판단되는 문서들을 식별한다. 이렇게 식별된 유사 문서들이 기준 문서와 동일한 카테고리에 속하는 경우 임베딩이 정확하게 이루어진 것으로 판단하고, 기준 문서와 상이한 카테고리에 속하는 문서를 유사 문서로 판단한 경우 임베딩이 부정확하게 이루어진 것으로 판단하고자 한다.Specifically, document embedding is performed in various ways for documents whose categories are identified, and as a result, documents judged to be similar to each document are identified. If the similar documents identified in this way belong to the same category as the reference document, it is determined that embedding has been accurately performed, and when the document belonging to a different category from the reference document is determined as a similar document, the embedding is determined to be incorrect.

따라서 본 발명의 실험 데이터는 각 문서별 키워드 목록과 함께 각 문서의 소속 카테고리가 명시되어 있어야 한다. 본 실험에서는 이러한 조건을 만족하는 데이터로 한국학술정보(KISS) 사이트에서 총 7개에 주제에 대해 3,147개의 논문을 수집하였다. 전반적인 실험은 Python 3.6을 이용하여 진행하였으며, 토큰 분리 작업에는 Komoran, word2Vec 모델링에는 Gensim, 벡터 연산에는 Numpy 패키지를 주로 사용하였다. 본 절에서는 실험의 전체 개요를 소개하고, 본 장의 이후 절에서는 실험의 주요 과정 및 결과를 소개한다. 우선 전체 실험 개요는 도 11과 같다.Therefore, in the experimental data of the present invention, the category to which each document belongs must be specified together with a list of keywords for each document. In this experiment, 3,147 papers were collected on the topic in a total of 7 articles on the Korean Academic Information (KISS) site as data that satisfies these conditions. The overall experiment was conducted using Python 3.6, and Komoran was mainly used for token separation, Gensim for word2Vec modeling, and Numpy package for vector operation. This section introduces the overall outline of the experiment, and the subsequent sections of this chapter introduce the main process and results of the experiment. First, an overview of the entire experiment is shown in FIG. 11.

도 11의 전체 실험 과정은 총 세 단계로 구분된다. 우선 Phase 1은 실험을 위한 복합 문서(Complex Document)를 생성하는 단계로, 복합 문서 생성의 필요성 및 과정은 다음 절인 4.2절에서 다룬다. 다음으로 Phase 2는 본 발명에 의해 다중 벡터를 생성하는 과정으로, 이 부분은 앞에서 자세히 소개한 바와 같다. 마지막 단계인 Phase 3에서는 동일한 문서 집합을 여러 방법론에 의해 임베딩한다. 본 발명을 포함하여 총 다섯 가지 방법으로 문서 임베딩을 수행하며, 이에 대한 자세한 내용은 4.3절에서 다룬다. 또한 이와 같이 다섯 가지 방법으로 수행된 문서 임베딩에 대한 성능 평가 결과는 본 장의 마지막 절인 4.4절에서 소개한다.The entire experimental process of FIG. 11 is divided into three steps. First of all, Phase 1 is the step of creating a complex document for experimentation, and the necessity and process of creating a compound document is dealt with in the next section, Section 4.2. Next, Phase 2 is a process of generating multiple vectors according to the present invention, and this part is as introduced in detail above. In the final stage, Phase 3, the same set of documents is embedded by several methodologies. Document embedding is performed in a total of five methods including the present invention, and details of this are covered in Section 4.3. In addition, the results of the performance evaluation for document embedding performed in these five ways are introduced in Section 4.4, the last section of this chapter.

4.2. 복합 문서 생성4.2. Create compound document

본 발명은 다양한 주제를 포함하고 있는 복합 문서를 하나의 벡터로 나타내는 기존 문서 임베딩 방식에 대한 한계를 개선하였다. 실생활에서 유통되는 대부분의 문서들은 하나의 문서가 여러 주제를 포함하고 있는 복합 문서로 간주될 수 있으며, 본 발명의 우수성은 다양한 주제를 포함하고 있는 문서의 임베딩에서 더욱 명확하게 나타날 것으로 예상한다. 이처럼 복합 문서의 표현 과정에서 나타나는 문서 임베딩 방법론의 성능 차이를 보다 명확히 비교하기 위해 본 실험에서는 각 문서의 카테고리 정보를 이용하여 인위적으로 복합 문서를 생성하였으며, 그 구체적인 과정은 다음과 같다.The present invention improves the limitations of the existing document embedding method in which a compound document including various subjects is represented as a single vector. Most of the documents circulated in real life can be regarded as a compound document in which one document includes several subjects, and the excellence of the present invention is expected to appear more clearly in the embedding of documents including various subjects. In order to more clearly compare the performance difference of the document embedding methodology in the process of expressing the compound document as described above, in this experiment, a compound document was artificially created by using the category information of each document, and the detailed process is as follows.

실험에 사용된 문서는 '인문과학', '사회과학', '자연과학', '공학', '농학', '의약학', 그리고 '예체능'의 7개 카테고리에 속한 3,147개의 논문의 초록이다. 총 7개의 카테고리에 대해 2개의 카테고리 조합(예: '자연과학 + 공학')을 만들 수 있는 경우의 수는 총 21개이며, 이들 각 조합에 대해 50개씩의 복합 문서를 생성한다. 예를 들어 '자연과학'에 속한 논문 하나의 초록과 '공학'에 속한 논문 하나의 초록을 병합하여 '자연과학 + 공학' 분야의 복합 문서를 생성할 수 있다(도 12).Documents used in the experiment are abstracts of 3,147 papers in 7 categories:'Humanities','Social Science','Natural Science','Engineering','Agriculture','Medicine', and'Art and Physical Education'. For a total of 7 categories, the number of cases where 2 category combinations (eg'Natural Science + Engineering') can be created is 21, and 50 compound documents are created for each of these combinations. For example, it is possible to create a composite document in the field of'Natural Science + Engineering' by merging one abstract in the'Natural Science' and one abstract in the'Engineering' (FIG. 12).

복합 문서의 생성에 사용될 문서를 카테고리별로 선정하는 방식은 두 가지로 적용하여 각각의 성능을 실험하였다. 첫 번째 기준은 무작위 추출 방식으로, 각 카테고리마다 각 조합에 참여할 문서를 무작위로 50개씩 선정하였다. 두 번째 방식은 각 카테고리의 중심에 기반을 두어 대표 문서를 선정하는 방식이다. 구체적으로는 각 카테고리에 속한 문서들의 키워드 벡터 평균을 산출하여 이를 카테고리의 중심으로 식별한 뒤, 해당 카테고리의 문서 중 카테고리의 중심에 근접한 상위 50개씩의 문서를 각 카테고리의 대표 문서로 선정하여 복합 문서 생성에 사용하였다.The performance of each was tested by applying two methods of selecting documents for each category to be used for creating compound documents. The first criterion was a random selection method, and 50 documents to participate in each combination were randomly selected for each category. The second method is to select a representative document based on the center of each category. Specifically, a keyword vector average of documents belonging to each category is calculated and identified as the center of the category, and then the top 50 documents of the category that are close to the center of the category are selected as the representative documents of each category, and a compound document Used for generation.

전술한 방식에 따라 무작위 선정 방식에 의해 복합 문서 1,050개를 생성하고, 중심 기반 방식에 의해 복합 문서 1,050개를 선정하였다. 복합 문서의 구성에 사용된 원본 문서 350개는 단일 문서 집합에서 제외하였다. 그 결과 총 3,847개의 문서(단일 문서 2,797개 + 복합 문서 1,050개)에 대한 학습을 통해 워드 벡터를 학습하였다. 워드 임베딩은 word2Vec 모델링을 사용하였으며, 구체적으로 벡터의 차원은 300차원, 학습 횟수는 50회, 그리고 window size는 5로 지정하였다. 그 결과 총 36,798개의 단어 벡터를 도출하였다.According to the above-described method, 1,050 compound documents were generated by a random selection method, and 1,050 compound documents were selected by the center-based method. 350 original documents used in the composition of compound documents were excluded from the single document set. As a result, word vectors were learned by learning a total of 3,847 documents (2,797 single documents + 1,050 complex documents). Word embedding was performed using word2Vec modeling. Specifically, the dimension of the vector was set to 300, the number of learning was 50 times, and the window size was set to 5. As a result, a total of 36,798 word vectors were derived.

4.3. 문서 벡터 생성4.3. Create document vector

본 절에서는 본 발명의 성능의 비교에 사용된 5가지의 문서 임베딩 방식, 즉 (1) Gen_doc2Vec, (2) doc2Vec_Key, (3) Proposed SV, (4) Multi-Vector (Ideal), 그리고 (5) Multi-Vector(Clustering) 방식을 소개한다(도 13).In this section, there are five document embedding methods used to compare the performance of the present invention, namely (1) Gen_doc2Vec, (2) doc2Vec_Key, (3) Proposed SV, (4) Multi-Vector (Ideal), and (5). A Multi-Vector (Clustering) method is introduced (FIG. 13).

도 13에서 (1) Gen_doc2Vec은 일반적인 doc2Vec 모델링을 통한 문서 임베딩 결과를 나타내며, (2) doc2Vec_Key는 일반적인 doc2Vec 모델링을 수행하되 용어 사전으로 키워드 리스트를 사용한 결과를 나타낸다. (3) Proposed SV는 문서에 명시된 키워드들의 워드 벡터 평균으로 문서 벡터를 도출한 결과이다. (1) ~ (3)은 모두 각 문서를 하나의 벡터로 표현한다는 점에서 공통점을 갖는다. 한편 (4)와 (5)는 각 문서를 다중 벡터로 표현하는 방식이다. 이 중 (4) Multi-Vector(Ideal)은 복합 문서를 구성하는 두 원본 문서의 원 카테고리 정보를 활용한다.In FIG. 13, (1) Gen_doc2Vec represents a document embedding result through general doc2Vec modeling, and (2) doc2Vec_Key represents a result of using a keyword list as a term dictionary while performing general doc2Vec modeling. (3) Proposed SV is the result of deriving the document vector by the word vector average of the keywords specified in the document. All of (1) to (3) have a common point in that each document is expressed as a single vector. On the other hand, (4) and (5) are methods of expressing each document as multiple vectors. Among them, (4) Multi-Vector (Ideal) utilizes the original category information of the two original documents constituting the composite document.

예를 들어 '자연과학 + 공학' 분야의 복합 문서의 경우 원 문서 두 개는 각각 '자연과학'과 '공학'의 카테고리에 속한다. 이 때 복합 문서를 구성하는 키워드 집합을 '자연과학' 카테고리의 문서에서 명시된 키워드 부분 집합과 '공학' 카테고리의 문서에서 명시된 키워드 부분 집합으로 구분한다. 이후 각 부분 집합에 속한 단어 벡터들의 평균 벡터를 구하고, 이들 두 개의 평균 벡터를 해당 복합 문서의 멤버 벡터로 사용한다. For example, in the case of a compound document in the field of'natural science + engineering', the two original documents belong to the category of'natural science' and'engineering', respectively. At this time, the keyword set constituting the composite document is divided into a keyword subset specified in the document of the'Natural Science' category and the keyword subset specified in the document of the'Engineering' category. After that, the average vector of word vectors belonging to each subset is obtained, and these two average vectors are used as member vectors of the compound document.

하지만 이 방식은 성능 비교를 위한 실험용으로만 수행 가능한 방식으로, 현실 세계의 복합 문서를 구성하는 키워드 집합은 이와 같은 방식으로 분할 가능한 사전 정보를 갖고 있지 않다. 따라서 본 발명에서는 키워드에 대한 군집화를 통해 키워드 집합을 부분 집합으로 분할하며, 이러한 방식으로 문서의 멤버 벡터를 도출한 결과가 (5) Multi-Vector(Clustering) 이다. 즉 본 발명에서 제안하는 핵심 방법론은 (5)이며, (4)는 실제로는 적용이 불가능하지만 본 발명의 성능에 관한 상대적 비교를 위해 소개한 이상적인 모델이다.However, this method can be performed only as an experiment for performance comparison, and the keyword set constituting the real-world composite document does not have prior information that can be divided in this manner. Therefore, in the present invention, the keyword set is divided into subsets through clustering of keywords, and the result of deriving the member vector of the document in this manner is (5) Multi-Vector (Clustering). That is, the core methodology proposed by the present invention is (5), and (4) is an ideal model introduced for a relative comparison regarding the performance of the present invention, although it is not practically applicable.

4.4. 성능 평가4.4. Performance evaluation

4.4.1. 성능 평가 척도4.4.1. Performance rating scale

본 부절에서는 앞에서 소개한 5가지 문서 임베딩 방법론의 성능을 평가하는 방법을 소개한다. 구체적으로는 5가지 문서 임베딩을 반복적으로 수행하고, 각 방식에 기반을 두어 복합 문서 각각에 대해 유사도가 가장 높은 문서를 식별한다. 만약 식별된 문서가 기준 문서와 동일한 카테고리에 속하는 경우 임베딩이 정확하게 이루어진 것으로 판단하고, 기준 문서와 상이한 카테고리에 속하는 문서를 유사 문서로 판단한 경우 임베딩이 부정확하게 이루어진 것으로 판단한다.This subsection introduces a method to evaluate the performance of the five document embedding methodologies introduced earlier. Specifically, five document embeddings are repeatedly performed, and the document with the highest similarity is identified for each of the compound documents based on each method. If the identified document belongs to the same category as the reference document, it is determined that embedding has been correctly performed, and when the document belonging to a different category from the reference document is determined to be a similar document, it is determined that the embedding is incorrect.

복합 문서는 두 카테고리의 문서를 병합하여 구성되었기 때문에, 본 실험에서는 각 문서에 대해 유사 문서를 두 개씩 추천하여 정확성을 평가하고자 한다. 다중 벡터인 (5)번과 (6)번의 경우 문서를 구성하는 두 개의 멤버 벡터 각각에 대해 가장 인접한 문서를 추천한다.Since the composite document was composed by merging documents of two categories, this experiment attempts to evaluate the accuracy by recommending two similar documents for each document. In the case of multiple vectors (5) and (6), the nearest document is recommended for each of the two member vectors constituting the document.

따라서 단일 벡터인 (1) ~ (4)번의 경우 유사 문서를 두 개씩 추천하는 다중 벡터 방법론과의 형평성을 유지하기 위해, 각 문서의 벡터와 인접한 문서를 두 개씩 추천한다. 유사 문서의 카테고리 일치 여부는 두 카테고리를 모두 맞춘 경우(Totally Correct), 둘 중의 하나만 맞춘 경우(Partially Correct), 그리고 둘 다 맞추지 못한 경우(Completely Incorrect)의 세 가지로 구분된다. Therefore, in the case of (1) to (4), which is a single vector, in order to maintain fairness with the multi-vector methodology, which recommends two similar documents, each document vector and two adjacent documents are recommended. The category matching of similar documents is divided into three categories: when both categories are matched (Totally Correct), only one of them (Partially Correct), and when both categories are not (Completely Incorrect).

이 때, 원 문서의 카테고리가 각각 '공학'과 '인문과학'인 경우, 유사 문서로 추천된 두 개의 문서 중 하나라도 '공학'이거나 '인문과학'인 경우와 두 개가 모두 '공학'이거나 모두 '인문과학'인 경우는 Partially Correct로 판정한다. 만약 유사 문서로 추천된 두 개의 문서가 '공학'과 '인문사회' 중 어디에도 속하지 않는 경우 이를 Completely Incorrect로 판정한다. 이러한 척도 하에서는 Totally Correct가 많고 Completely Incorrect가 적을수록 임베딩이 정확하게 이루어진 것이라고 판단할 수 있다.In this case, if the category of the original document is'Engineering' and'Humanities', one of the two documents recommended as similar documents is'Engineering' or'Humanities', and both are'Engineering' or both In the case of'humanities', it is judged as Partially Correct. If two documents recommended as similar documents do not belong to either'engineering' or'humanities and society', they are judged as Completely Incorrect. Under this scale, it can be determined that the more Totally Correct and less Completely Incorrect, the more accurate embedding has been.

4.4.2. 성능 분석 결과4.4.2. Performance analysis result

본 부절에서는 앞에서 소개한 성능 평가 척도에 따라 5가지 문서 임베딩 방법론의 정확성을 비교한 결과를 제시한다(도 14).This subsection presents the results of comparing the accuracy of the five document embedding methodologies according to the performance evaluation scale introduced earlier (Fig. 14).

도 14의 그림 (a)는 각 카테고리의 중심에 기반을 두어 카테고리별 대표 문서 50개씩을 선정한 실험 결과이고, 도 14의 그림 (b)는 각 카테고리별로 대표 문서 50개씩을 임의로 선정한 실험 결과이다. 도 15는 도 14의 결과를 전체 문서 대비 각 판정에 해당하는 문서 수의 비율로 제시한 것이며, 도 16은 이를 그래프로 도식화한 것이다.Fig. 14 (a) is an experiment result of selecting 50 representative documents for each category based on the center of each category, and Fig. 14 (b) is an experiment result of randomly selecting 50 representative documents for each category. FIG. 15 shows the results of FIG. 14 as a ratio of the number of documents corresponding to each decision relative to the total documents, and FIG. 16 is a graph showing this.

도 16에서 중심 기반 선정 방식과 무작위 선정 방식의 두 경우 모두 Multi-Vector(Ideal) 방식이 가장 우수한 성능, 즉 Totally Correct가 가장 높고 Completely Incorrect가 가장 낮게 나타나는 결과를 보임을 확인하였다. 또한 본 발명의 방식인 Multi-Vector(Clustering)의 경우 원 문서의 카테고리 사전 정보를 사용하지 않았음에도 Multi-Vector(Ideal)와 거의 유사한 정확도를 갖는 것으로 나타났다. 즉 전반적으로 다중 벡터 방식이 단일 벡터 방식에 비해 우수한 성능을 나타낸 것으로 확인되었다. In FIG. 16, it was confirmed that in both the center-based selection method and the random selection method, the Multi-Vector (Ideal) method showed the best performance, that is, the result in which Totally Correct was the highest and Completely Incorrect was the lowest. In addition, in the case of Multi-Vector (Clustering), which is the method of the present invention, it has been found that it has almost the same accuracy as Multi-Vector (Ideal) even though the category dictionary information of the original document is not used. In other words, it was found that overall, the multi-vector scheme showed superior performance compared to the single vector scheme.

특히 Totally Correct, 즉 두 개의 카테고리를 모두 맞춘 문서의 비율은 전통적인 doc2Vec를 포함한 세 가지 단일 벡터 방식의 경우 8% ~ 10.67%로 나타난 반면, 본 발명에서 제안하는 Multi-Vector(Clustering)의 경우 이 비율이 23.71% ~ 24.86%로 높게 나타남을 확인할 수 있다. 위의 결과에서 더욱 주목해야 할 부분은 막대 그래프의 최상단에 위치한 Completely Incorrect의 비율이다. 이는 유사 문서로 추천된 두 개의 문서가 원 카테고리 두 곳 중 어느 곳에도 속하지 않는 경우의 비율을 나타낸다. In particular, Totally Correct, that is, the ratio of documents matching both categories is 8% to 10.67% in the case of three single vector methods including the traditional doc2Vec, whereas in the case of Multi-Vector (Clustering) proposed in the present invention, this ratio It can be seen that this is high as 23.71% ~ 24.86%. In the above results, it is worth noting more about the ratio of Completely Incorrect located at the top of the bar graph. This represents the percentage of cases where two documents recommended as similar documents do not belong to either of the two original categories.

예를 들어 '인문과학'과 '공학' 카테고리의 문서 두 개를 병합한 복합 문서에 대해 유사 문서 두 개를 추천했는데, 이들 두 문서가 '의약학' 또는 '예체능' 등에 속하는 경우를 나타낸다. 세 가지의 단일 벡터 방식의 경우 이 비율이 24.19% ~ 28.10%으로 나타난 반면, Multi-Vector(Clustering)의 경우 이 비율이 16.38% ~ 16.67%로 낮게 나타났다. 본 발명이 우수한 성능을 나타내는 원인은 복합 문서를 구성하고 있는 특질을 의미적으로 분해하여 표현하기 때문인 것으로 판단된다.For example, two similar documents were recommended for a composite document that merged two documents in the'Humanities' and'Engineering' categories, and these two documents belong to'medicine' or'arts and sports'. In the case of the three single vector methods, this ratio was 24.19% ~ 28.10%, whereas in the case of Multi-Vector (Clustering), this ratio was low, 16.38% ~ 16.67%. It is believed that the reason why the present invention exhibits excellent performance is that the features constituting the composite document are semantically decomposed and expressed.

텍스트 분석을 다루는 기존의 기술들은 주로 텍스트 구조화 이후의 단계, 즉 분류, 군집화, 토픽 모델링 등 분석 및 활용 단계에 초점을 맞추어 수행되어 왔다. 그러나 최근 텍스트 구조화 작업이 분석 결과의 품질을 실질적으로 좌우한다는 발견에 따라 이 과정에 대한 중요성이 강조되고 있으며, 이에 따라 문서 임베딩에 대한 연구가 활발히 수행되고 있다. 이에 본 발명에서는 doc2Vec으로 대표되는 기존 문서 구조화 방법의 한계를 지적하고, 이를 극복하기 위한 방안을 제시하였다.Existing technologies dealing with text analysis have mainly been performed with focus on the stages after text structuring, that is, analysis and utilization stages such as classification, clustering, and topic modeling. However, with the recent discovery that text structuring work substantially influences the quality of analysis results, the importance of this process is being emphasized, and accordingly, studies on document embedding are being actively conducted. Accordingly, in the present invention, a limitation of the existing document structuring method represented by doc2Vec was pointed out, and a method for overcoming this was suggested.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art will variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the following claims. You will understand that you can do it.

100: 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 시스템
110: 사용자 단말 130: 다중 벡터 문서 임베딩 장치
150: 데이터베이스
210: 대상 문서 파싱부 220: 단어 임베딩부
230: 키워드 벡터 추출부 240: 키워드 군집화부
250: 다중 벡터 생성부 260: 제어부100: Multi-vector document embedding system through semantic decomposition of compound documents
110: user terminal 130: multi-vector document embedding device
150: database
210: target document parsing unit 220: word embedding unit
230: keyword vector extraction unit 240: keyword clustering unit
250: multiple vector generation unit 260: control unit

Claims

A target document parsing unit that separates all terms into tokens through parsing a target document included in the document set;
A word embedding unit that converts each token into a word vector through word embedding;
A keyword vector extraction unit for generating a keyword vector set for each document by extracting a word vector of a token designated as a keyword for each target document from among the word vectors;
A keyword clustering unit for generating a plurality of keyword clusters by performing clustering analysis on the keyword vector set for each document; And
Semantic decomposition of a composite document including a multiple vector generator that generates a vector for each keyword cluster based on the keyword vector included in each of the plurality of keyword clusters and determines it as a multiple vector of the target document associated with the plurality of keyword clusters Device for embedding multi-vector documents through.

The method of claim 1, wherein the target document parsing unit
A multi-vector document embedding apparatus through semantic decomposition of a compound document, characterized in that the parsing is performed after adding a keyword set of the target document to a word dictionary used for the parsing.

The method of claim 2, wherein the target document parsing unit
When there is no keyword set, a key word extracted through analysis of the target document is determined as a keyword. A multi-vector document embedding apparatus through semantic decomposition of a compound document, characterized in that.

The method of claim 1, wherein the word embedding unit
A multi-vector document embedding apparatus through semantic decomposition of a compound document, characterized in that the word embedding is performed using Word2Vec and a vector having an n-dimensional real value for each token is generated as the word vector.

The method of claim 1, wherein the keyword vector extraction unit
A multi-vector document embedding apparatus through semantic decomposition of a compound document, characterized in that for generating a keyword vector set including each keyword of the keyword set and a word vector pair for the keyword for a specific target document as one element.

The method of claim 4, wherein the keyword clustering unit
A multi-vector document through semantic decomposition of a compound document, characterized in that the cluster analysis is performed by applying hierarchical cluster analysis or non-hierarchical cluster analysis based on the n-dimensional real value of the keyword vector included in the keyword vector set. Embedding device.

The method of claim 1, wherein the multiple vector generator
A multi-vector document embedding apparatus through semantic decomposition of a compound document, characterized in that generating a vector for each keyword cluster through an average of keyword vectors included in each keyword cluster.

The method of claim 7, wherein the multiple vector generator
A multi-vector document embedding apparatus through semantic decomposition of a compound document, characterized in that a member vector set including the vector for each keyword group is generated as a member vector and is determined as the multiple vector.

In the method performed in a multi-vector document embedding apparatus,
Separating all terms into tokens through parsing a target document included in the document set;
Converting each token into a word vector through word embedding;
Generating a keyword vector set for each document by extracting a word vector of a token designated as a keyword for each target document from the word vectors;
Generating a plurality of keyword clusters by performing clustering analysis on the keyword vector set for each document; And
Through semantic decomposition of a compound document including the step of generating a vector for each keyword cluster based on the keyword vector included in each of the plurality of keyword clusters and determining as a multiple vector of a target document for the plurality of keyword clusters How to embed multiple vector documents.