KR20240047736A

KR20240047736A - Method for domain-specific keyword extraction and system thereof

Info

Publication number: KR20240047736A
Application number: KR1020220127136A
Authority: KR
Inventors: 김성진; 최윤정; 변정은; 배국진; 김지민; 송경우; 최호식; 이은경; 황혜지; 신경준; 오창대; 김태로; 김예원
Original assignee: 한국과학기술정보연구원
Priority date: 2022-10-05
Filing date: 2022-10-05
Publication date: 2024-04-12

Abstract

본 개시는 도메인 핵심 키워드 추출 방법 및 그 방법이 적용된 시스템에 관한 것이다. 본 개시에 따른 도메인 핵심 키워드 추출 방법은, 제1 도메인의 문서 집합에서 추출된 제1 도메인 참조 키워드들을 획득하는 단계와, 기 학습된 임베딩 모델을 이용하여, 대상 문서에 대한 제1 임베딩 벡터를 획득하는 단계와, 상기 임베딩 모델을 이용하여, 상기 대상 문서에서 추출된 키워드들 중 일부에 대한 제2 임베딩 벡터들을 획득하는 단계와, 상기 임베딩 모델을 이용하여, 상기 제1 도메인 참조 키워드들 각각에 대한 제3 임베딩 벡터들을 획득하는 단계 및 상기 제2 임베딩 벡터들 및 상기 제3 임베딩 벡터들 중, 상기 제1 임베딩 벡터와 기준치 이상의 유사도를 가지는 임베딩 벡터의 키워드를 상기 대상 문서의 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 단계를 포함할 수 있다.This disclosure relates to a method for extracting domain core keywords and a system to which the method is applied. The domain core keyword extraction method according to the present disclosure includes obtaining first domain reference keywords extracted from a document set of a first domain, and using a previously learned embedding model to obtain a first embedding vector for the target document. obtaining second embedding vectors for some of the keywords extracted from the target document using the embedding model, and using the embedding model to obtain second embedding vectors for each of the first domain reference keywords. Obtaining third embedding vectors, and among the second embedding vectors and the third embedding vectors, a keyword of an embedding vector having a similarity of more than a reference value to the first embedding vector is used for the first domain of the target document. It may include the step of outputting key keywords.

Description

Domain key keyword extraction method and system {METHOD FOR DOMAIN-SPECIFIC KEYWORD EXTRACTION AND SYSTEM THEREOF}

본 개시는 도메인 핵심 키워드 추출 방법 및 그 방법이 적용된 시스템에 관한 것이다. 보다 자세하게는, 대상 문서 및 대상 문서에 포함되지 않은 특정 분야의 중요 키워드를 추출하는 방법 및 그 방법이 적용된 시스템에 관한 것이다.This disclosure relates to a method for extracting domain core keywords and a system to which the method is applied. More specifically, it relates to a target document, a method for extracting important keywords in a specific field not included in the target document, and a system to which the method is applied.

인공지능 혹은 텍스트 마이닝 등의 기법을 통해, 특정 문서 집합에 포함된 핵심 키워드(Keyword)를 추출하는 기술들이 종래에 제공되고 있다. 그러나, 이러한 종래 기술들은 기 수집된 문서 내에서 자주 등장하는 단어를 핵심 키워드로 추출하는 방법만을 제시할 뿐이다.Technologies for extracting key keywords included in a specific document set have been provided in the past through techniques such as artificial intelligence or text mining. However, these prior technologies only present a method of extracting words that frequently appear in previously collected documents as key keywords.

그러나, 기 수집된 문서가 특정 분야에 해당하지 않더라도, 해당 문서가 상기 특정 분야에서 통용되는 용어와 관련된 키워드들을 다수 포함하고 있을 수도 있다. 예를 들면, 기술 분야의 문서 상에는 기술 사업화 분야와 유관한 키워드가 다수 존재할 수 있으나, 사용자는 상기한 종래기술 활용만으로는 기술 분야 문서에서 기술 사업화 분야의 핵심 키워드(용어)들을 추출할 수 없는 것이다.However, even if the previously collected document does not correspond to a specific field, the document may contain many keywords related to terms commonly used in the specific field. For example, there may be many keywords related to the technology commercialization field in a document in the technology field, but the user cannot extract key keywords (terms) in the technology commercialization field from the technology field document simply by using the above-mentioned prior technology.

또한, 종래 기술은 특정 문서 집합에서 TF-IDF(Term Frequency-Inverse Document Frequency) 등과 같은 텍스트 마이닝 기법을 이용하여 핵심 키워드를 추출하고 있으나, 이는 단순히 해당 문서 내에서 등장하는 빈도 수를 바탕으로 키워드를 추출하는 방식이기 때문에 해당 문서가 속한 분야와 연관 관계가 없는 키워드까지도 추출되는 오류가 빈번하여 사용자의 후보정이 요구되는 문제점이 있었다.In addition, the prior art extracts key keywords from a specific document set using text mining techniques such as TF-IDF (Term Frequency-Inverse Document Frequency), but this simply extracts keywords based on the frequency of their occurrence in the document. Because it is an extraction method, there are frequent errors in extracting even keywords that are not related to the field to which the document belongs, which requires post-editing by the user.

따라서, 특정 문서 집합 및 해당 특정 문서에 등장하지 않는 동일 분야의 핵심 키워드까지 자동으로 추출하는 방법 및 그 방법이 적용된 시스템의 제공이 요구되나, 기존의 키워드 추출 방법은 그러한 기능을 제공하지 못하고 있다.Therefore, there is a need to provide a method to automatically extract key keywords in the same field that do not appear in a specific document set and the specific document, and a system to which the method is applied, but existing keyword extraction methods do not provide such a function.

본 개시의 몇몇 실시예들을 통하여 달성하고자 하는 기술적 과제는, 기 수집된 문서의 핵심 키워드 및 상기 문서의 유관 분야에 관련된 것이되, 상기 기 수집된 문서에 포함되지 않은 핵심 키워드를 추출하는 방법 및 그 방법이 적용된 시스템을 제공하는 것이다.The technical task to be achieved through some embodiments of the present disclosure is related to key keywords of previously collected documents and related fields of the documents, and a method for extracting key keywords that are not included in the previously collected documents and the same. The goal is to provide a system to which the method has been applied.

본 개시의 몇몇 실시예들을 통하여 달성하고자 하는 다른 기술적 과제는, 정해진 키워드 추출 한도 내에서 효율적으로 키워드 추출을 수행하는 방법을 제공하는 것이다. Another technical task to be achieved through some embodiments of the present disclosure is to provide a method of efficiently performing keyword extraction within a set keyword extraction limit.

본 개시의 몇몇 실시예들을 통하여 달성하고자 하는 또 다른 기술적 과제는, 특정 분야에 특화되도록 인공지능 언어 모델을 학습시키는 방법을 제공하는 것이다.Another technical task to be achieved through some embodiments of the present disclosure is to provide a method of training an artificial intelligence language model to be specialized in a specific field.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

상기한 기술적 과제를 해결하기 위한 본 개시의 일 실시예에 따른 도메인 핵심 키워드 추출 방법은, 제1 도메인의 문서 집합에서 추출된 제1 도메인 참조 키워드들을 획득하는 단계와, 기 학습된 임베딩 모델을 이용하여, 대상 문서에 대한 제1 임베딩 벡터를 획득하는 단계와, 상기 임베딩 모델을 이용하여, 상기 대상 문서에서 추출된 키워드들 중 일부에 대한 제2 임베딩 벡터들을 획득하는 단계와, 상기 임베딩 모델을 이용하여, 상기 제1 도메인 참조 키워드들 각각에 대한 제3 임베딩 벡터들을 획득하는 단계 및 상기 제2 임베딩 벡터들 및 상기 제3 임베딩 벡터들 중, 상기 제1 임베딩 벡터와 기준치 이상의 유사도를 가지는 임베딩 벡터의 키워드를 상기 대상 문서의 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 단계를 포함할 수 있다.The domain core keyword extraction method according to an embodiment of the present disclosure to solve the above technical problem includes obtaining first domain reference keywords extracted from a document set of the first domain, and using a previously learned embedding model. Thus, obtaining a first embedding vector for the target document, using the embedding model, obtaining second embedding vectors for some of the keywords extracted from the target document, and using the embedding model Thus, obtaining third embedding vectors for each of the first domain reference keywords, and among the second embedding vectors and the third embedding vectors, an embedding vector having a similarity greater than a reference value with the first embedding vector. It may include outputting a keyword as a core keyword for the first domain of the target document.

몇몇 실시예에서, 상기 제1 도메인의 문서 집합에, 상기 대상 문서는 포함되지 않는 것일 수 있다.In some embodiments, the target document may not be included in the document set of the first domain.

몇몇 실시예에서, 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 단계는, 상기 제3 임베딩 벡터들 중 일부의 키워드를 상기 대상 문서의 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 단계를 포함할 수 있다.In some embodiments, outputting keywords as core keywords for the first domain may include outputting keywords of some of the third embedding vectors as core keywords for the first domain of the target document. there is.

몇몇 실시예에서, 상기 도메인 핵심 키워드 추출 방법은, 상기 제1 도메인에 대한 핵심 키워드들 중 상기 제1 임베딩 벡터와 기준치 이하의 유사도를 갖고, 다른 제1 도메인에 대한 핵심 키워드들과 기준치 이상의 유사도를 갖는 키워드를 제1 도메인에 대한 핵심 키워드에서 제외하는 단계를 더 포함할 수 있다.In some embodiments, the method for extracting domain core keywords includes key keywords for the first domain having a similarity of the first embedding vector below a standard value, and core keywords for another first domain having a similarity of more than the standard value. A step of excluding keywords having a keyword from core keywords for the first domain may be further included.

몇몇 실시예에서, 상기 대상 문서에서 추출된 키워드는, 단일 단어 키워드, 이중 단어 키워드 및 삼중 단어 키워드를 포함하는 것일 수 있다.In some embodiments, keywords extracted from the target document may include single-word keywords, double-word keywords, and triple-word keywords.

상기한 기술적 과제를 해결하기 위한 본 개시의 다른 실시예에 따른 도메인 핵심 키워드 추출 시스템은, 복수의 문서를 수신하는 네트워크 인터페이스와, 도메인 핵심 키워드 추출 프로그램이 로드되는 메모리 및 상기 도메인 핵심 키워드 추출 프로그램을 실행하는 하나 이상의 프로세서를 포함할 수 있고, 상기 도메인 핵심 키워드 추출 프로그램은, 제1 도메인의 문서 집합에서 추출된 제1 도메인 참조 키워드들을 획득하는 인스트럭션(instruction)과, 기 학습된 임베딩 모델을 이용하여, 대상 문서에 대한 제1 임베딩 벡터를 획득하는 인스트럭션과, 상기 임베딩 모델을 이용하여, 상기 대상 문서에서 추출된 키워드들 중 일부에 대한 제2 임베딩 벡터들을 획득하는 인스트럭션과, 상기 임베딩 모델을 이용하여, 상기 제1 도메인 참조 키워드들 각각에 대한 제3 임베딩 벡터들을 획득하는 인스트럭션 및 상기 제2 임베딩 벡터들 및 상기 제3 임베딩 벡터들 중, 상기 제1 임베딩 벡터와 기준치 이상의 유사도를 가지는 임베딩 벡터의 키워드를 상기 대상 문서의 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 인스트럭션을 포함할 수 있다.A domain core keyword extraction system according to another embodiment of the present disclosure for solving the above technical problem includes a network interface for receiving a plurality of documents, a memory on which a domain core keyword extraction program is loaded, and the domain core keyword extraction program. It may include one or more processors to execute, and the domain core keyword extraction program uses instructions for obtaining first domain reference keywords extracted from a document set of the first domain and a previously learned embedding model. , an instruction for obtaining a first embedding vector for the target document, an instruction for obtaining second embedding vectors for some of the keywords extracted from the target document using the embedding model, and using the embedding model , an instruction for obtaining third embedding vectors for each of the first domain reference keywords, and a keyword of an embedding vector having a similarity of more than a reference value to the first embedding vector among the second embedding vectors and the third embedding vectors. It may include an instruction to output as a key keyword for the first domain of the target document.

몇몇 실시예에서, 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 인스트럭션은, 상기 제3 임베딩 벡터들 중 일부의 키워드를 상기 대상 문서의 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 인스트럭션을 포함할 수 있다.In some embodiments, the instruction for outputting keywords as core keywords for the first domain may include instructions for outputting keywords of some of the third embedding vectors as core keywords for the first domain of the target document. there is.

몇몇 실시예에서, 상기 도메인 핵심 키워드 추출 프로그램은, 상기 제1 도메인에 대한 핵심 키워드들 중 상기 제1 임베딩 벡터와 기준치 이하의 유사도를 갖고, 다른 제1 도메인에 대한 핵심 키워드들과 기준치 이상의 유사도를 갖는 키워드를 제1 도메인에 대한 핵심 키워드에서 제외하는 인스트럭션을 더 포함할 수 있다.In some embodiments, the domain core keyword extraction program selects core keywords for the first domain that have a similarity of the first embedding vector or less than a standard value and to core keywords for another first domain that have a similarity of more than the standard value. It may further include an instruction for excluding the keyword having a keyword from the core keyword for the first domain.

도 1은 본 개시의 일 실시예에 따른 도메인 핵심 키워드 추출 시스템을 설명하기 위한 예시적인 블록도이다.
도 2는 본 개시의 몇몇 실시예에 따른 임베딩 모델과 종래의 임베딩 모델 간 차이점을 예시적으로 설명하기 위한 도면이다.
도 3은 본 개시의 몇몇 실시예에서 이용되는 임베딩 모델을 설명하기 위한 도면이다.
도 4는 본 개시의 몇몇 실시예에 따른 임베딩 모델을 학습 결과 변화되는 학습 손실 발생 수를 설명하기 위한 도면이다.
도 5는 본 개시의 다른 실시예에 따른 도메인 핵심 키워드 추출 방법의 순서도이다.
도 6은 본 개시의 몇몇 실시예에서 수행될 수 있는 제1 임베딩 벡터 및 제2 임베딩 벡터를 획득하는 단계를 예시적으로 설명하기 위한 도면이다.
도 7은 본 개시의 몇몇 실시예에서 수행될 수 있는 제3 임베딩 벡터를 획득하는 방법을 예시적으로 설명하기 위한 도면이다.
도 8은 본 개시의 또 다른 실시예에 따른 도메인 핵심 키워드 추출 시스템의 하드웨어 구성도이다.1 is an exemplary block diagram illustrating a domain core keyword extraction system according to an embodiment of the present disclosure.
FIG. 2 is a diagram illustrating the differences between an embedding model according to some embodiments of the present disclosure and a conventional embedding model.
Figure 3 is a diagram for explaining an embedding model used in some embodiments of the present disclosure.
FIG. 4 is a diagram illustrating the number of learning losses that change as a result of learning an embedding model according to some embodiments of the present disclosure.
Figure 5 is a flowchart of a method for extracting domain core keywords according to another embodiment of the present disclosure.
FIG. 6 is a diagram illustrating the steps of obtaining a first embedding vector and a second embedding vector that can be performed in some embodiments of the present disclosure.
FIG. 7 is a diagram illustrating a method of obtaining a third embedding vector that can be performed in some embodiments of the present disclosure.
Figure 8 is a hardware configuration diagram of a domain core keyword extraction system according to another embodiment of the present disclosure.

이하, 첨부된 도면을 참조하여 본 개시의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명의 기술적 사상은 이하의 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 이하의 실시예들은 본 발명의 기술적 사상을 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명의 기술적 사상은 청구항의 범주에 의해 정의될 뿐이다.Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the attached drawings. The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the technical idea of the present invention is not limited to the following embodiments and may be implemented in various different forms. The following examples are merely intended to complete the technical idea of the present invention and to be used in the technical field to which the present invention pertains. It is provided to fully inform those skilled in the art of the scope of the present invention, and the technical idea of the present invention is only defined by the scope of the claims.

본 개시의 다양한 실시예들에 대한 설명에 앞서, 이하의 실시예들에서 사용되는 용어에 대해 명확하게 하기로 한다Before describing various embodiments of the present disclosure, terms used in the following embodiments will be clarified.

이하의 실시예들에서, ‘도메인(Domain)’은 키워드 추출의 대상이 되는 분야를 의미하는 것일 수 있다. 예를 들어, '기술 사업화 도메인에 대한 키워드 추출'의 경우, 기술 사업화 분야의 용어를 추출하는 것을 의미할 수 있는 것이다.In the following embodiments, ‘Domain’ may refer to a field that is the target of keyword extraction. For example, in the case of 'extracting keywords for technology commercialization domain', this may mean extracting terms in the technology commercialization field.

이하의 실시예들에서, '임베딩 모델(Embedding model)'은, 사람이 쓰는 자연어를 기계가 이해할 수 있는 숫자 형태인 벡터(Vector)로 변환하는 인공지능 모델을 의미할 수 있다. 당해 기술 분야에서 상기 임베딩 모델은 '언어 모델(Language model)' 등의 용어와 혼용될 수 있다.In the following embodiments, 'Embedding model' may refer to an artificial intelligence model that converts natural language used by humans into a vector, a numerical form that machines can understand. In the technical field, the embedding model may be used interchangeably with terms such as 'language model'.

이하의 실시예들에서, '임베딩 벡터(Embedding vector)'는, 상기 임베딩 모델에 의해 특정 키워드를 벡터로 변환한 결과로 도출되는 벡터를 의미할 수 있다.In the following embodiments, an 'embedding vector' may refer to a vector derived as a result of converting a specific keyword into a vector by the embedding model.

본 개시를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. In describing the present disclosure, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description will be omitted.

이하, 도면들을 참조하여 본 개시의 몇몇 실시예들을 설명한다.Hereinafter, several embodiments of the present disclosure will be described with reference to the drawings.

도 1은 본 개시의 일 실시예에 따른 도메인 핵심 키워드 추출 시스템(1000)의 구조를 예시적으로 설명하기 위한 블록도이다.FIG. 1 is a block diagram illustrating the structure of a domain core keyword extraction system 1000 according to an embodiment of the present disclosure.

도 1을 참조하면, 본 개시의 몇몇 실시예에 따른 도메인 핵심 키워드 추출 시스템(1000)은 문서 전처리부(110)를 포함할 수 있다. 몇몇 실시예에서, 문서 전처리부(110)는, 특정 규칙에 따라 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서를 전처리 하는 동작을 수행할 수 있다.Referring to FIG. 1, a domain core keyword extraction system 1000 according to some embodiments of the present disclosure may include a document pre-processing unit 110. In some embodiments, the document preprocessing unit 110 may perform an operation of preprocessing documents input to the domain core keyword extraction system 1000 according to specific rules.

본 개시의 몇몇 실시예에서, 문서 전처리부(110)는, 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에 포함된 텍스트 데이터(Text data)에 대한 정규화(Normalization)를 수행할 수 있다. In some embodiments of the present disclosure, the document preprocessor 110 may perform normalization on text data included in a document input to the domain core keyword extraction system 1000.

여기서, 상기 정규화는, 표현 방법이 다르나 실질적 의미가 같은 단어들을 단일 표현으로 통합하는 것을 의미할 수 있다. 예를 들어, 'USA' 및 'US'는 서로 표현이 상이한 텍스트에 해당하지만, 실질적으로 미국을 의미하는 동일한 단어에 해당하므로, 문서 전처리부(110)에 의해 'USA'로 통일될 수 있는 것이다. 보다 바람직하게는, 'US' 표현으로 통일하는 것이 'USA' 표현보다 더 적은 저장 공간을 요구하는 것과 같이, 문서 전처리부(110)는 실질적으로 동일한 단어들을 상기 단어들이 의미하는 복수의 표현들 중 가장 적은 저장 공간을 요구하는 표현으로 통합할 수 있다.Here, the normalization may mean integrating words with different expression methods but the same actual meaning into a single expression. For example, 'USA' and 'US' correspond to texts with different expressions, but actually correspond to the same word meaning the United States, so they can be unified as 'USA' by the document preprocessing unit 110. . More preferably, just as unification into the 'US' expression requires less storage space than the 'USA' expression, the document pre-processing unit 110 selects substantially the same words among the plurality of expressions signified by the words. It can be integrated into a representation that requires the least amount of storage space.

본 개시의 몇몇 다른 실시예에서, 문서 전처리부(110)는 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에 포함된 영문 텍스트 데이터의 대문자 및 소문자 구분을 통합할 수 있다.In some other embodiments of the present disclosure, the document pre-processing unit 110 may integrate uppercase and lowercase distinction of English text data included in a document input to the domain core keyword extraction system 1000.

본 개시의 몇몇 또 다른 실시예에서, 문서 전처리부(1100)는 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에서 등장 빈도가 기준치 미만인 단어를 제거할 수 있다. 본 실시예에 따르면, 도메인 핵심 키워드 추출 시스템(1000)은, 상기 입력된 문서에 포함된 문장을 분석하는 데 영향을 끼치지 않으나, 불필요한 저장 공간을 차지하는 데이터들을 삭제하여 문서에서 특정 키워드를 추출하는 동작에 소모되는 자원을 절약하는 효과를 달성할 수 있다.In some other embodiments of the present disclosure, the document pre-processing unit 1100 may remove words whose frequency of appearance is less than the reference value from the document input to the domain core keyword extraction system 1000. According to this embodiment, the domain core keyword extraction system 1000 extracts specific keywords from the document by deleting data that does not affect the analysis of sentences included in the input document but occupies unnecessary storage space. The effect of saving resources consumed in operation can be achieved.

본 개시의 몇몇 또 다른 실시예에서, 문서 전처리부(1100)는 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서의 텍스트 데이터에 포함된 불용어(Stopword)를 삭제할 수 있다. 여기서, 상기 불용어는 텍스트 데이터에 포함되나 실질적인 의미가 없는 감탄사 등을 의미할 수 있다.In some other embodiments of the present disclosure, the document pre-processing unit 1100 may delete stopwords included in text data of a document input to the domain core keyword extraction system 1000. Here, the stop word may mean an exclamation point or the like that is included in text data but has no actual meaning.

본 개시의 몇몇 실시예에 따른 도메인 핵심 키워드 추출 시스템(1000)은 임베딩 모델 학습부(130)를 포함할 수 있다. 몇몇 실시예에서, 임베딩 모델 학습부(130)는, 특정 텍스트 데이터의 입력의 결과로 임베딩 벡터를 출력하는 임베딩 모델을 학습시키는 동작을 수행할 수 있다. 이하 도 2를 참조하여 설명한다.The domain core keyword extraction system 1000 according to some embodiments of the present disclosure may include an embedding model learning unit 130. In some embodiments, the embedding model learning unit 130 may perform an operation of training an embedding model that outputs an embedding vector as a result of inputting specific text data. Hereinafter, the description will be made with reference to FIG. 2.

본 개시의 몇몇 실시예에서, 임베딩 모델 학습부(130)는 기 학습된 한국어 기반 임베딩 모델(21)에 특정 도메인에 관한 데이터를 입력하여 상기 특정 도메인과 관련된 텍스트에 대한 분석 성능이 제고되도록 학습시켜 추가 학습된 임베딩 모델(65)을 구축할 수 있다. 여기서, 상기 특정 도메인은, 도메인 핵심 키워드 추출 시스템(1000)이 핵심 키워드 추출 동작을 수행하는 기 정의된 대상 도메인일 수 있으나, 도메인 핵심 키워드 추출 시스템(1000)이 핵심 키워드 추출 동작을 수행하도록 입력 받은 대상 문서와 관련된 도메인으로 자동 특정될 수도 있다.In some embodiments of the present disclosure, the embedding model learning unit 130 inputs data about a specific domain into the previously learned Korean-based embedding model 21 and trains it to improve analysis performance for text related to the specific domain. An additional learned embedding model (65) can be constructed. Here, the specific domain may be a predefined target domain on which the domain core keyword extraction system 1000 performs the core keyword extraction operation, but the domain core keyword extraction system 1000 receives input to perform the core keyword extraction operation. It may be automatically identified as a domain related to the target document.

단, 종래의 임베딩 모델(21)이 한국어를 분석하도록 학습된 한국어 기반 임베딩 모델(21)인 것은 본 개시의 이해를 돕기 위한 예시일 뿐이며, 임베딩 모델(21)이 분석하는 언어에 한정을 두지는 아니한다. 또한, 다른 몇몇 실시예에서, 임베딩 모델(21)은 기 학습된 것이 아니며 임베딩 모델 학습부(13)에서 학습시킨 결과로 구축된 모델일 수도 있다.However, the fact that the conventional embedding model 21 is a Korean-based embedding model 21 learned to analyze the Korean language is only an example to help understand the present disclosure, and is not limited to the language analyzed by the embedding model 21. No. Additionally, in some other embodiments, the embedding model 21 is not previously learned and may be a model constructed as a result of learning by the embedding model learning unit 13.

본 개시의 몇몇 다른 실시예에서, 임베딩 모델(21)은 인공 신경망 기반의 자연어 처리 모델일 수 있다. 즉, 임베딩 모델(21)은 T5 모델, BERT(Bidirectional Encoder Representations from Transformers) 모델, GPT 모델 중 어느 하나인 것으로 이해될 수 있으나, 이에 한정을 두지는 아니한다.In some other embodiments of the present disclosure, the embedding model 21 may be a natural language processing model based on an artificial neural network. That is, the embedding model 21 may be understood as one of a T5 model, a BERT (Bidirectional Encoder Representations from Transformers) model, and a GPT model, but is not limited thereto.

보다 바람직하게는, 도메인 핵심 키워드 추출 시스템(1000)이 특정 도메인의 문서에서 특정 핵심 키워드를 추출하는 성능을 최적화하기 위해서는 상기 임베딩 모델(21)은 BERT 모델일 수 있다. 여기서, 상기 BERT 모델은, 문장에 포함된 각각의 키워드들의 의미를, 문장에 포함된 자신을 제외한 다른 키워드들에 대한 정보를 참조하여 분석할 수 있다. 도 3을 참조하여 설명하면, [CLS](32) 및 I, Love, You(33) 단어들이 임베딩 모델(65)에 함께 입력될 때, [CLS](32) 입력은 단일 단어이나, 출력된 [CLS](31)에 포함된 정보는 I, Love, You(33) 단어들의 정보를 포함할 수 있는 것이다.More preferably, in order for the domain core keyword extraction system 1000 to optimize the performance of extracting specific core keywords from documents of a specific domain, the embedding model 21 may be a BERT model. Here, the BERT model can analyze the meaning of each keyword included in the sentence by referring to information about keywords other than the keywords included in the sentence. 3, when the words [CLS] (32) and I, Love, You (33) are input together into the embedding model (65), the [CLS] (32) input is a single word, but the output The information included in [CLS] (31) may include information on the words I, Love, You (33).

본 개시의 몇몇 또 다른 실시예에서, 도 4를 참조하면, 임베딩 모델 학습부(130)는 임베딩 모델(21)에 특정 도메인에 관한 데이터를 입력하여 추가 학습된 임베딩 모델(65)을 구축할 수 있되, 추가 학습된 임베딩 모델(65)이 학습 대상 데이터 전체를 학습한 횟수와 학습 손실 발생 수는 반비례함을 이해할 수 있을 것이다.In some other embodiments of the present disclosure, referring to FIG. 4, the embedding model learning unit 130 may construct an additionally learned embedding model 65 by inputting data about a specific domain into the embedding model 21. However, it can be understood that the number of times the additionally learned embedding model 65 learns the entire learning target data and the number of learning losses occur are inversely proportional.

본 개시의 몇몇 실시예에 따른 도메인 핵심 키워드 추출 시스템(1000)은 핵심 키워드 추출부(140)를 포함할 수 있다. 몇몇 실시예에서, 핵심 키워드 추출부(140)는 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에서 기 정의된 규칙에 기초하여 핵심 키워드를 추출하는 동작을 수행할 수 있다.The domain core keyword extraction system 1000 according to some embodiments of the present disclosure may include a core keyword extraction unit 140. In some embodiments, the core keyword extraction unit 140 may perform an operation of extracting core keywords from a document input to the domain core keyword extraction system 1000 based on predefined rules.

본 개시의 몇몇 실시예에서, 핵심 키워드 추출부(140)는 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에 포함된 텍스트에서, 각각의 키워드가 문서에 포함된 빈도를 측정할 수 있다. 여기서, 기준치 이상의 빈도를 갖는 키워드는 상기 문서에 대한 참조 키워드로 추출될 수 있다.In some embodiments of the present disclosure, the core keyword extraction unit 140 may measure the frequency with which each keyword is included in the document from the text included in the document input to the domain core keyword extraction system 1000. Here, keywords with a frequency greater than the reference value can be extracted as reference keywords for the document.

본 개시의 몇몇 다른 실시예에서, 핵심 키워드 추출부(140)는 도메인 핵심 키워드 추출 시스템(1000)에 포함된 유사도 비교부(150)의 연산 결과에 기초하여, 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에서 핵심 키워드를 추출하는 동작을 수행할 수 있으나, 이에 관하여는 후술한다.In some other embodiments of the present disclosure, the core keyword extraction unit 140 is operated by the domain core keyword extraction system 1000 based on the calculation result of the similarity comparison unit 150 included in the domain core keyword extraction system 1000. The operation of extracting key keywords from the input document can be performed, but this will be described later.

본 개시의 몇몇 실시예에 따른 도메인 핵심 키워드 추출 시스템(1000)은 키워드 정제부(120)를 포함할 수 있다. 키워드 정제부(120)는 핵심 키워드 추출부(140)의 핵심 키워드 추출 동작 수행의 결과로 추출된 키워드에서, 기 정의된 규칙에 기초하여 불필요한 것으로 판단되는 핵심 키워드의 일부를 삭제하는 동작을 수행할 수 있다.The domain core keyword extraction system 1000 according to some embodiments of the present disclosure may include a keyword purification unit 120. The keyword refinement unit 120 may perform an operation of deleting some of the core keywords that are determined to be unnecessary based on predefined rules from the keywords extracted as a result of the core keyword extraction operation of the core keyword extraction unit 140. You can.

본 개시의 몇몇 실시예에서, 키워드 정제부(120)는 벡터화 된 각각의 핵심 키워드를 벡터화 된 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서와 비교하여, 상기 벡터화 된 핵심 키워드와 상기 문서 간 유사도를 측정한 결과를 유사도 비교부(150)로부터 수신할 수 있다. 또한, 상기 문서와의 유사도가 기준치 이하인 핵심 키워드는 불필요한 것으로 판단하여, 삭제할 수도 있다. 여기서, 상기 핵심 키워드는 상기 문서에 포함된 것이 아닐 수도 있다.In some embodiments of the present disclosure, the keyword refinement unit 120 compares each vectorized core keyword with a document input to the vectorized domain core keyword extraction system 1000 to determine the similarity between the vectorized core keyword and the document. The result of measuring can be received from the similarity comparison unit 150. Additionally, key keywords whose similarity to the document is below the standard value may be determined to be unnecessary and deleted. Here, the key keywords may not be included in the document.

본 개시의 몇몇 다른 실시예에서, 키워드 정제부(120)는 벡터화 된 각각의 핵심 키워드 간 유사도를 측정하고, 자신을 제외한 다른 핵심 키워드 벡터들과 기준치 이상의 유사도를 갖는 핵심 키워드 벡터의 키워드를 불필요한 것으로 판단하고 삭제할 수도 있다. 단, 상기 벡터화 된 각각의 핵심 키워드 간 유사도는, 당해 분야의 통상의 기술자에 의해 상기 핵심 키워드 각각의 벡터를 n차원의 좌표로 표현하였을 때 서로 간의 거리를 의미하는 것일 수 있음으로 이해될 수 있다.In some other embodiments of the present disclosure, the keyword refinement unit 120 measures the similarity between each vectorized core keyword, and determines the keyword of the core keyword vector that has a similarity greater than the standard value with other core keyword vectors except itself as unnecessary. You can judge and delete it. However, the similarity between each of the vectorized core keywords can be understood by those skilled in the art to mean the distance between each other when the vector of each of the core keywords is expressed in n-dimensional coordinates. .

본 개시의 몇몇 또 다른 실시예에서, 키워드 정제부(120)는 핵심 키워드 추출부(140)에 의해 추출된 핵심 키워드 각각에, 상기 벡터화 된 핵심 키워드와 상기 문서 간 유사도를 제1 스코어로, 자신을 제외한 다른 핵심 키워드 벡터들 과의 유사도를 제2 스코어로 부여할 수 있다. 또한, 상기 제1 스코어와 제2 스코어의 합이 기준치 이상인 핵심 키워드가 존재할 경우, 상기 핵심 키워드를 불필요한 것으로 판단하여 삭제할 수 있다.In some other embodiments of the present disclosure, the keyword refinement unit 120 assigns the similarity between the vectorized core keyword and the document as a first score to each core keyword extracted by the core keyword extraction unit 140. Similarity with other core keyword vectors except , can be assigned as a second score. Additionally, if there is a core keyword for which the sum of the first score and the second score is greater than or equal to a standard value, the core keyword may be determined to be unnecessary and deleted.

본 개시의 몇몇 실시예에 따른 도메인 핵심 키워드 추출 시스템(1000)은 유사도 비교부(150)를 포함할 수 있다. 몇몇 실시예에서, 유사도 비교부(150)는 도메인 핵심 키워드 추출 시스템(1000)에 입력된 복수의 문서 간 유사도, 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에서 핵심 키워드 추출부(140)의 핵심 키워드 추출 동작에 의해 추출된 핵심 키워드 간 유사도 등을 측정할 수 있다.The domain core keyword extraction system 1000 according to some embodiments of the present disclosure may include a similarity comparison unit 150. In some embodiments, the similarity comparison unit 150 determines the similarity between a plurality of documents input to the domain core keyword extraction system 1000 and the core keyword extraction unit 140 in the documents input to the domain core keyword extraction system 1000. The similarity between core keywords extracted through the core keyword extraction operation can be measured.

본 개시의 몇몇 실시예에서, 유사도 비교부는 벡터화 된 핵심 키워드 각각을 벡터화 된 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서와 비교하여, 상기 벡터화 된 핵심 키워드와 상기 문서 간 유사도를 측정할 수 있다. 또한, 유사도 비교부(150)는 상기 측정된 유사도를 키워드 정제부(120)에 송신할 수 있다.In some embodiments of the present disclosure, the similarity comparison unit may compare each vectorized core keyword with a document input to the vectorized domain core keyword extraction system 1000 to measure the similarity between the vectorized core keyword and the document. . Additionally, the similarity comparison unit 150 may transmit the measured similarity to the keyword refining unit 120.

본 개시의 몇몇 실시예에서, 유사도 비교부(150)는 도메인 핵심 키워드 추출 시스템(1000)에 입력된 문서에서 추출된 핵심 키워드를 벡터화 한 핵심 키워드 벡터와 상기 문서를 벡터화 한 문서 벡터 간 유사도를 측정할 수 있다. 또한 유사도 비교부(150)는 상기 측정된 유사도를 핵심 키워드 추출부(140)에 송신할 수 있다.In some embodiments of the present disclosure, the similarity comparison unit 150 measures the similarity between a core keyword vector obtained by vectorizing core keywords extracted from a document input to the domain core keyword extraction system 1000 and a document vector obtained by vectorizing the document. can do. Additionally, the similarity comparison unit 150 may transmit the measured similarity to the key keyword extraction unit 140.

지금까지 도 1 내지 도 4를 참조하여 도메인 핵심 키워드 추출 시스템(1000)의 구성 및 동작과 도메인 핵심 키워드 추출 시스템(1000)이 적용될 수 있는 예시적인 환경에 대하여 설명하였다. 또한, 도메인 핵심 키워드 추출 시스템(1000)의 동작과 관련된 다양한 실시예들을 언급하였다. 이상의 실시예들은 예시적인 것으로만 이해해야 하며 이에 한정을 두지 않음에 유의하여야 한다.So far, the configuration and operation of the domain core keyword extraction system 1000 and an exemplary environment to which the domain core keyword extraction system 1000 can be applied have been described with reference to FIGS. 1 to 4. In addition, various embodiments related to the operation of the domain key keyword extraction system 1000 are mentioned. It should be noted that the above embodiments should be understood as illustrative only and are not limiting.

이하, 도 5 내지 도 7을 참조하여 본 개시의 다른 실시예에 따른 도메인 핵심 키워드 추출 방법에 관하여 자세히 설명한다.Hereinafter, a method for extracting domain core keywords according to another embodiment of the present disclosure will be described in detail with reference to FIGS. 5 to 7.

도 5는 본 개시의 다른 실시예에 따른 도메인 핵심 키워드 추출 방법의 순서도이다. 이하의 순서도에서 설명될 단계는 특별한 언급이 없는 한 도메인 핵심 키워드 추출 시스템에 의해 수행되는 것으로 이해될 수 있다.Figure 5 is a flowchart of a method for extracting domain core keywords according to another embodiment of the present disclosure. The steps described in the flowchart below can be understood as being performed by a domain core keyword extraction system unless otherwise specified.

도 5에 도시된 단계 S100에서, 도메인 핵심 키워드 추출 시스템은 제1 도메인 참조 키워드를 획득할 수 있다. 여기서, 상기 제1 도메인은 도메인 핵심 키워드 추출 시스템이 수행하는 핵심 키워드 추출 동작의 대상이 되는 도메인을 예시적으로 표현하는 것이며, 한정적인 의미가 아님에 유의하여야 할 것이다.In step S100 shown in FIG. 5, the domain core keyword extraction system may obtain a first domain reference keyword. Here, it should be noted that the first domain illustratively represents the domain that is the target of the core keyword extraction operation performed by the domain core keyword extraction system, and is not limited.

본 개시의 몇몇 실시예에서, 상기 제1 도메인은 도메인 핵심 키워드 추출 시스템이 입력 받은 문서의 내용에 따라 결정되는 것일 수 있다.In some embodiments of the present disclosure, the first domain may be determined by the domain core keyword extraction system according to the content of the input document.

단계 S100과 관련된 몇몇 실시예에서, 도메인 핵심 키워드 추출 시스템은 입력 받은 제1 도메인과 관련된 문서 집합에 포함된 텍스트에서, 상기 텍스트에 포함된 단어 각각이 상기 문서에서 등장하는 빈도를 산출하고, 상기 등장 빈도가 기준치 이상인 단어를 제1 도메인 참조 키워드로 식별할 수 있다.In some embodiments related to step S100, the domain core keyword extraction system calculates, in the text included in the document set related to the input first domain, the frequency with which each word included in the text appears in the document, and the appearance Words with a frequency equal to or higher than the standard value may be identified as first domain reference keywords.

단계 S100과 관련된 몇몇 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 입력 받은 대상 문서에 포함된 텍스트에서, 상기 대상 문서에 포함된 단어 각각이 상기 대상 문서에서 등장하는 빈도를 산출하고, 상기 등장 빈도가 기준치 이상인 단어를 대상 문서 참조 키워드로 식별할 수 있다. 여기서, 상기 대상 문서는 상기 제1 도메인과 같은 도메인에 해당하는 문서일 수 있으나, 상기 도메인 핵심 키워드 추출 시스템이 입력 받은 제1 도메인의 문서 집합에는 포함되지 않는 것일 수 있다.In some other embodiments related to step S100, the domain core keyword extraction system calculates the frequency of occurrence of each word included in the target document in the text included in the input target document, and the frequency of appearance is Words that are above the standard can be identified as target document reference keywords. Here, the target document may be a document corresponding to the same domain as the first domain, but may not be included in the document set of the first domain received by the domain key keyword extraction system.

또한, 상기 대상 문서 참조 키워드를 식별하는 단계는, 단일 단어 참조 키워드를 식별하는 단계, 이중 단어 참조 키워드를 식별하는 단계, 삼중 단어 참조 키워드를 식별하는 단계를 포함할 수 있다. 여기서, 단일 단어 참조 키워드는 하나의 단어, 이중 단어 참조 키워드는 두 개의 단어, 삼중 단어 참조 키워드는 세 개의 단어로 이루어진 참조 키워드로 이해될 수 있을 것이다.Additionally, identifying the target document reference keyword may include identifying a single-word reference keyword, identifying a double-word reference keyword, and identifying a triple-word reference keyword. Here, a single-word reference keyword may be understood as a reference keyword consisting of one word, a double-word reference keyword may be understood as a reference keyword consisting of two words, and a triple-word reference keyword may be understood as a reference keyword consisting of three words.

상기 단일 단어 참조 키워드, 이중 단어 참조 키워드 및 삼중 단어 참조 키워드 각각은 당해 기술 분야에서 '유니그램(Unigram) 참조 키워드', '바이그램(Bigram) 참조 키워드', '트라이그램(Trigram) 참조 키워드'의 용어로 이해될 수 있을 것이다.Each of the single-word reference keywords, double-word reference keywords, and triple-word reference keywords refers to 'Unigram reference keywords', 'Bigram reference keywords', and 'Trigram reference keywords' in the art. It can be understood in terms.

예를 들면, 도메인 핵심 키워드 추출 시스템이 입력 받은 대상 문서가 '계산 가능한 수에 대해, 수리 명제 자동 생성 문제의 응용' 이라는 문장을 포함한다고 가정할 때, 단일 단어 참조 키워드를 식별할 때에는 상기 문장이 '계산', '가능한', '수에', '대해', '수리', '명제', '자동', '생성', '문제의', '응용' 등으로 분리되나, 이중 단어 참조 키워드를 식별하는 단계에서는 '계산 가능한', '가능한 수에', '수에 대해', '대해 수리', '수리 명제', '명제 자동', '자동 생성', '생성 문제의', '문제의 응용' 등으로 분리될 수 있는 것이다.For example, assuming that the target document received by the domain core keyword extraction system includes the sentence 'Application of automatic mathematical proposition generation problem for computable numbers', the above sentence is used when identifying single word reference keywords. Separated into 'computation', 'possible', 'number', 'about', 'mathematical', 'proposition', 'automatic', 'generation', 'problem', 'application', etc., but double word reference keywords In the step of identifying 'computable', 'possible number', 'about number', 'about mathematical', 'numerical proposition', 'proposition automatic', 'automatically generated', 'generating problem', 'problem' It can be separated into 'application of', etc.

또한, 단일 단어 참조 키워드, 이중 단어 참조 키워드 및 삼중 단어 참조 키워드는 별개의 참조 키워드로 저장될 수 있다. 보다 바람직하게는, 도메인 핵심 키워드 추출 시스템은 단일 단어 참조 키워드, 이중 단어 참조 키워드 및 삼중 단어 참조 키워드 각각의 그룹을 별도로 구성할 수 있다.Additionally, single-word reference keywords, double-word reference keywords, and triple-word reference keywords can be stored as separate reference keywords. More preferably, the domain core keyword extraction system may separately configure each group of single-word reference keywords, double-word reference keywords, and triple-word reference keywords.

도메인 핵심 키워드 추출 시스템이 상기하였듯 텍스트에 포함된 단어 각각이 문서에 등장하는 빈도에 기초하여 입력 받은 문서에 대한 참조 키워드를 획득하는 것은 본 개시의 이해를 돕기 위한 예시일 뿐이며, 참조 키워드가 특정되는 규칙에는 한정을 두지 아니한다.As mentioned above, the domain core keyword extraction system acquires reference keywords for the input document based on the frequency with which each word included in the text appears in the document. This is only an example to help understand the present disclosure, and the reference keyword is a specific There are no limitations to the applicable rules.

다음으로 단계 S200에서, 도메인 핵심 키워드 추출 시스템은 도메인 핵심 키워드 추출 시스템의 임베딩 모델 학습부에 의해 추가 학습된 임베딩 모델을 이용하여, 입력 받은 대상 문서에 대한 제1 임베딩 벡터를 획득할 수 있다. 한편, 상기 추가 학습된 임베딩 모델은 상기 임베딩 모델을 제1 도메인과 관련된 문서를 학습 데이터로 하여 추가 학습한 임베딩 모델일 수 있다. 이하 도 6을 참조하여 설명한다.Next, in step S200, the domain core keyword extraction system may obtain a first embedding vector for the input target document using an embedding model additionally learned by the embedding model learning unit of the domain core keyword extraction system. Meanwhile, the additionally learned embedding model may be an embedding model that is additionally learned using a document related to the first domain as learning data. Hereinafter, it will be described with reference to FIG. 6.

제1 임베딩 벡터(66)는 추가 학습된 임베딩 모델(65)이 대상 문서(61)의 텍스트 데이터를 입력 받은 결과로 출력된 대상 문서(61)의 벡터 형태일 수 있다. 추가 학습된 임베딩 모델(65)에 의하여 대상 문서(61)가 텍스트 데이터로 변환되는 방식은 임베딩 모델의 종류에 따라 상이할 수 있다. 예를 들어, 본 개시의 바람직한 실시예에 따른, BERT 임베딩 모델의 경우 대상 문서(61)를 대상 문서(61)에 포함된 단어 각각들 간의 연관 관계에 대한 정보가 포함된 768차원의 제1 임베딩 벡터(66)로 변환할 수 있다.The first embedding vector 66 may be in the form of a vector of the target document 61 output as a result of the additionally learned embedding model 65 receiving text data of the target document 61. The method by which the target document 61 is converted into text data by the additionally learned embedding model 65 may differ depending on the type of embedding model. For example, in the case of the BERT embedding model according to a preferred embodiment of the present disclosure, the target document 61 is a 768-dimensional first embedding containing information about the association relationship between each word included in the target document 61. It can be converted to vector (66).

단계 S200과 관련된 몇몇 실시예에서, 도메인 핵심 키워드 추출 시스템은 대상 문서(61)를 추가 학습된 임베딩 모델(65)에 입력하기 전 전처리 할 수 있다. 상기 대상 문서(61)를 전처리하는 동작은 정규화, 대문자 및 소문자 구분 통합, 불용어 삭제 중 어느 하나일 수도 있다.In some embodiments related to step S200, the domain key keyword extraction system may preprocess the target document 61 before inputting it into the additional learned embedding model 65. The operation of preprocessing the target document 61 may be any one of normalization, integration of uppercase and lowercase letters, and deletion of stop words.

단계 S300에서, 도 6을 참조하면, 도메인 핵심 키워드 추출 시스템은 추가 학습된 임베딩 모델(65)에 대상 문서(61)에서 추출된 일부 키워드(62, 63, 64)에 대한 정보를 입력하여 제2 임베딩 벡터(67, 68, 69)를 획득할 수 있다.In step S300, referring to FIG. 6, the domain core keyword extraction system inputs information about some keywords (62, 63, 64) extracted from the target document (61) into the additionally learned embedding model (65) to create a second Embedding vectors (67, 68, 69) can be obtained.

단계 S300과 관련된 몇몇 실시예에서, 대상 문서(61)에서 추출된 일부 키워드(62, 63, 64)는, 대상 문서(61)에 포함된 단어들 중 대상 문서(61) 내 기준치 이상의 빈도로 포함되는 단어일 수 있다.In some embodiments related to step S300, some keywords 62, 63, and 64 extracted from the target document 61 are included among words included in the target document 61 at a frequency greater than the reference value in the target document 61. It could be a word.

단계 S300과 관련된 몇몇 실시예에서, 도메인 핵심 키워드 추출 시스템은 대상 문서(61)의 단일 단어 참조 키워드(62)를 추가 학습된 임베딩 모델(65)에 입력하여, 단일 단어 참조 키워드에 대한 제2 임베딩 벡터(67)를 획득할 수 있다.In some embodiments related to step S300, the domain key keyword extraction system inputs the single word reference keyword 62 of the target document 61 into the additional learned embedding model 65 to generate a second embedding for the single word reference keyword. Vector 67 can be obtained.

단계 S300과 관련된 몇몇 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 대상 문서(61)의 이중 단어 참조 키워드(63)를 추가 학습된 임베딩 모델(65)에 입력하여, 이중 단어 참조 키워드에 대한 제2 임베딩 벡터(68)를 획득할 수 있다.In some other embodiments related to step S300, the domain core keyword extraction system inputs the double-word reference keywords 63 of the target document 61 into the additional learned embedding model 65 to generate a second search for the double-word reference keywords. The embedding vector 68 can be obtained.

단계 S300과 관련된 몇몇 또 다른 실시예에서 ,도메인 핵심 키워드 추출 시스템은 대상 문서(61)의 삼중 단어 참조 키워드(64)를 추가 학습된 임베딩 모델(65)에 입력하여, 삼중 단어 참조 키워드에 대한 제2 임베딩 벡터(69)를 획득할 수 있다.In some other embodiments related to step S300, the domain core keyword extraction system inputs the triple-word reference keywords 64 of the target document 61 into the additionally trained embedding model 65 to generate a search engine for the triple-word reference keywords. 2 Embedding vector 69 can be obtained.

본 개시의 몇몇 실시예에서, 도메인 핵심 키워드 추출 시스템은 상기 도메인 핵심 키워드 추출 시스템에 의하여 획득된 단일 단어 참조 키워드, 이중 단어 참조 키워드 및 삼중 단어 참조 키워드에 대한 제2 임베딩 벡터(67, 68, 69) 각각에 대한 상기 단계 S200에서 획득된 제1 임베딩 벡터(66)와의 유사도 측정을 수행할 수 있으나, 이에 관하여는 후술한다.In some embodiments of the present disclosure, the domain core keyword extraction system includes second embedding vectors 67, 68, 69 for single-word reference keywords, double-word reference keywords, and triple-word reference keywords obtained by the domain core keyword extraction system. ) Similarity measurement with the first embedding vector 66 obtained in step S200 may be performed for each, but this will be described later.

단계 S400에서, 도메인 핵심 키워드 추출 시스템은 단계 S100에서 획득한 제1 도메인 참조 키워드들 각각에 대한 제3 임베딩 벡터들을 획득할 수 있다. 상기 제3 임베딩 벡터들을 획득하는 단계를 보다 자세히 설명하기 위하여, 도 7을 참조하여 설명하기로 한다.In step S400, the domain core keyword extraction system may obtain third embedding vectors for each of the first domain reference keywords obtained in step S100. In order to explain the step of acquiring the third embedding vectors in more detail, it will be described with reference to FIG. 7.

단계 S400과 관련된 몇몇 실시예에서, 도 7을 참조하면, 도메인 핵심 키워드 추출 시스템은 단계 S100에서 사전 추출된 제1 도메인 참조 키워드(82)를 추가 학습된 임베딩 모델(65)에 입력하여 상기 제1 도메인 참조 키워드(82) 각각에 해당하는 제3 임베딩 벡터(83)들을 획득할 수 있다.In some embodiments related to step S400, referring to FIG. 7, the domain core keyword extraction system inputs the first domain reference keyword 82 pre-extracted in step S100 into the additionally learned embedding model 65 to obtain the first domain reference keyword 82. Third embedding vectors 83 corresponding to each domain reference keyword 82 can be obtained.

본 개시의 몇몇 실시예에서, 도메인 핵심 키워드 추출 시스템은 상기 도메인 핵심 키워드 추출 시스템에 의하여 획득된 제3 임베딩 벡터(83) 각각에 대한 상기 단계 S200에서 획득된 제1 임베딩 벡터(66)와의 유사도 측정을 수행할 수 있으나, 이에 관하여는 후술한다.In some embodiments of the present disclosure, the domain core keyword extraction system measures the similarity of each third embedding vector 83 obtained by the domain core keyword extraction system with the first embedding vector 66 obtained in step S200. can be performed, but this will be described later.

다음으로, 단계 S500에서, 도메인 핵심 키워드 추출 시스템은 단계 S200 내지 S400에서 획득한 임베딩 벡터들에 기초하여, 제1 도메인의 핵심 키워드를 추출할 수 있다. 상기 제1 도메인 핵심 키워드를 추출하는 단계를 보다 자세히 설명하기 위하여, 도 7을 참조하도록 한다.Next, in step S500, the domain core keyword extraction system may extract the core keyword of the first domain based on the embedding vectors obtained in steps S200 to S400. For a more detailed description of the step of extracting the first domain core keyword, please refer to FIG. 7.

단계 S500과 관련된 몇몇 실시예에서, 도메인 핵심 키워드 추출 시스템은 단계 S300에서 획득한 단일 단어 참조 키워드(62)의 제2 임베딩 벡터(67) 각각에 대한 제1 임베딩 벡터(66)와의 유사도를 측정하고, 상기 유사도가 기준치 이상인 제2 임베딩 벡터(67)의 단일 단어 참조 키워드(62)를 대상 문서(61)에 대한 단일 단어 핵심 키워드(91)로 결정할 수 있다. 단, 상기 단일 단어 핵심 키워드(91)는 대상 문서(61)에 포함된 키워드 중 어느 하나일 수 있다.In some embodiments related to step S500, the domain core keyword extraction system measures the similarity with the first embedding vector 66 for each of the second embedding vectors 67 of the single word reference keyword 62 obtained in step S300; , the single-word reference keyword 62 of the second embedding vector 67 whose similarity is greater than or equal to the reference value may be determined as the single-word core keyword 91 for the target document 61. However, the single word core keyword 91 may be any one of the keywords included in the target document 61.

여기서, 상기 도메인 핵심 키워드 추출 시스템이 제2 임베딩 벡터(67)와 제1 임베딩 벡터(66)와의 유사도는, 코사인 유사도(Cosine similarity) 기법 또는 유클리드 거리(Euclidean distance) 기법에 의하여 측정된 것일 수 있으나, 이에 본 개시가 한정되는 것은 아니며, 벡터 간의 유사도를 측정하는 자연어 처리 분야의 통상적인 방법이라면 어느 하나에 한정을 두지는 아니한다. 또한, 상기한 임베딩 벡터 간 유사도를 측정하는 방법은 후술될 도메인 핵심 키워드 추출 시스템이 복수의 임베딩 벡터 간의 유사도를 측정하는 다른 실시예에서도 마찬가지임은 물론이다.Here, the similarity between the second embedding vector 67 and the first embedding vector 66 of the domain core keyword extraction system may be measured by the cosine similarity technique or the Euclidean distance technique. , the present disclosure is not limited thereto, and is not limited to any common method in the field of natural language processing that measures similarity between vectors. In addition, it goes without saying that the method of measuring the similarity between the above-described embedding vectors is also applied to other embodiments in which the domain core keyword extraction system, which will be described later, measures the similarity between a plurality of embedding vectors.

단계 S500과 관련된 몇몇 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 단계 S300에서 획득한 이중 단어 참조 키워드(63)의 제2 임베딩 벡터(68) 각각에 대한 제1 임베딩 벡터(66)와의 유사도를 측정하고, 상기 유사도가 기준치 이상인 제2 임베딩 벡터(68)의 이중 단어 참조 키워드(63)를 대상 문서(61)에 대한 이중 단어 핵심 키워드(92)로 결정할 수 있다. 단, 상기 이중 단어 핵심 키워드(92)는 대상 문서(61)에 포함된, 두 개의 단어로 구성된 합성어 텍스트 중 어느 하나일 수 있다.In some other embodiments related to step S500, the domain core keyword extraction system measures the similarity with the first embedding vector 66 for each of the second embedding vectors 68 of the double word reference keyword 63 obtained in step S300. And, the double-word reference keyword 63 of the second embedding vector 68 whose similarity is greater than or equal to the reference value may be determined as the double-word key keyword 92 for the target document 61. However, the double-word core keyword 92 may be any one of compound text consisting of two words included in the target document 61.

단계 S500과 관련된 몇몇 또 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 단계 S300에서 획득한 삼중 단어 참조 키워드(64)의 제2 임베딩 벡터(69) 각각에 대한 제1 임베딩 벡터(66)와의 유사도를 측정하고, 상기 유사도가 기준치 이상인 제2 임베딩 벡터(69)의 삼중 단어 참조 키워드(64)를 대상 문서(61)에 대한 삼중 단어 핵심 키워드(93)로 결정할 수 있다. 단, 상기 삼중 단어 핵심 키워드(93)는 대상 문서(61)에 포함된, 세 개의 단어로 구성된 합성어 텍스트 중 어느 하나일 수 있다.In some other embodiments related to step S500, the domain core keyword extraction system determines the similarity with the first embedding vector 66 for each of the second embedding vectors 69 of the triple word reference keyword 64 obtained in step S300. measurement, and the triple word reference keyword 64 of the second embedding vector 69 whose similarity is greater than or equal to the reference value can be determined as the triple word core keyword 93 for the target document 61. However, the triple word core keyword 93 may be any one of the compound text consisting of three words included in the target document 61.

단계 S500과 관련된 몇몇 또 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 단계 S400에서 획득한 제1 도메인 참조 키워드(82)의 제3 임베딩 벡터(83) 각각에 대한 제1 임베딩 벡터(66)와의 유사도를 측정하고, 상기 유사도가 기준치 이상인 제3 임베딩 벡터(83)의 제1 도메인 참조 키워드(82)를 대상 문서(61)에 대한 제1 도메인 핵심 키워드(94)로 결정할 수 있다. 단, 상기 제1 도메인 핵심 키워드(94)는 대상 문서(61)에 포함된 것일 수 있다.In some other embodiments related to step S500, the domain core keyword extraction system determines the similarity with the first embedding vector 66 for each of the third embedding vectors 83 of the first domain reference keyword 82 obtained in step S400. can be measured, and the first domain reference keyword 82 of the third embedding vector 83 whose similarity is greater than or equal to the reference value can be determined as the first domain key keyword 94 for the target document 61. However, the first domain core keyword 94 may be included in the target document 61.

단계 S500과 관련된 몇몇 또 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 단일 단어 핵심 키워드(91), 이중 단어 핵심 키워드(92), 삼중 단어 핵심 키워드(93) 및 제1 도메인 핵심 키워드(94)를 도 7에 도시된 제1 도메인 최종 핵심 키워드 그룹(160)에 포함시킬 수도 있다.In some other embodiments related to step S500, the domain core keyword extraction system extracts single word core keywords (91), double word core keywords (92), triple word core keywords (93), and first domain core keywords (94). It may also be included in the first domain final core keyword group 160 shown in FIG. 7.

종래에는 대상 문서에 포함된 단어가 등장하는 빈도수를 기준으로 대상 문서의 핵심 키워드를 추출하는 방법이 추천되었으나, 그러한 경우 사용자가 원하지 않는 무의미한 핵심 키워드가 추출되거나, 사용자가 추출하려 하는 도메인과 유관함에도 불구하고 대상 문서에서 등장하는 빈도가 적어 추출되지 않는 키워드가 존재하는 문제점이 있었다. 본 실시예에 따르면, 도메인 핵심 키워드 추출 시스템은 대상 문서에서 제1 도메인과 유관한 핵심 키워드를 정확히 추출할 수 있을 뿐만 아니라, 대상 문서에 포함되지 않은 제1 도메인에 대한 핵심 키워드를 추출하는 효과를 달성할 수도 있다.Conventionally, a method of extracting core keywords from a target document was recommended based on the frequency of occurrence of words included in the target document, but in such cases, meaningless core keywords that the user does not want are extracted or even though they are related to the domain the user is trying to extract. However, there was a problem with keywords not being extracted due to their low frequency of appearance in the target document. According to this embodiment, the domain core keyword extraction system can not only accurately extract core keywords related to the first domain from the target document, but also has the effect of extracting core keywords for the first domain that are not included in the target document. It can also be achieved.

다음으로, 단계 S600에서, 도메인 핵심 키워드 추출 시스템은 제1 도메인 최종 핵심 키워드 그룹에 포함된 키워드들을 정제할 수 있다. 본 개시의 몇몇 실시예에서, 상기 키워드를 정제하는 동작은, 실질적으로 동일한 의미를 가지나 표현이 다를 뿐인 복수의 키워드들 중 하나만을 남기고 나머지를 삭제하는 동작을 의미할 수 있다. 이하, 도 7을 참조하여 설명한다.Next, in step S600, the domain core keyword extraction system may refine keywords included in the first domain final core keyword group. In some embodiments of the present disclosure, the operation of refining the keyword may mean an operation of leaving only one of a plurality of keywords that have substantially the same meaning but different expressions and deleting the rest. Hereinafter, description will be made with reference to FIG. 7.

예를 들면, 상기 제1 도메인 최종 핵심 키워드 그룹(160)에 '그라핀', '그라핀 적용사례', '그라핀 적용사례 근접', '바이오센서', '바이오센서 적용사례' 등의 복수의 키워드가 있다고 가정할 때, '그라핀 적용사례' 및 '그라핀 적용사례 근접' 등은 모두 '그라핀'에 대한 추가적인 의미를 담고 있을 뿐, 키워드로서의 의미는 실질적으로 동일하므로 삭제될 수 있는 것이다. 단, 키워드가 삭제되는 기준에 관하여는 후술한다.For example, in the first domain final core keyword group 160, a plurality of words such as 'graphene', 'graphene application case', 'graphene application case proximity', 'biosensor', 'biosensor application case', etc. Assuming that there is a keyword, 'graphene application examples' and 'graphene application examples close to' all contain additional meanings to 'graphene', and the meaning as keywords is substantially the same, so they can be deleted. will be. However, the criteria for deleting keywords will be described later.

단계 S600과 관련된 몇몇 실시예에서, 도메인 핵심 키워드 추출 시스템은 제1 도메인 최종 핵심 키워드 그룹(160)에 포함된 복수의 동일한 키워드 중 하나만을 남기고 나머지를 삭제할 수 있다.In some embodiments related to step S600, the domain core keyword extraction system may retain only one of the plurality of identical keywords included in the first domain final core keyword group 160 and delete the rest.

가령, 대상 문서(61)에서 추출된 단일 단어 핵심 키워드(91)에 포함된 '그라핀' 키워드와 제1 도메인 문서 집합(81)에서 추출된 제1 도메인 핵심 키워드(94)에 포함된 '그라핀' 키워드가 모두 제1 도메인 최종 핵심 키워드 그룹(160)에 포함되었을 경우, 도메인 핵심 키워드 추출 시스템은 동일한 키워드가 두 개 이상 존재하는 것으로 판단하고, 제1 도메인 문서 집합(81)으로부터 추출된 '그라핀' 키워드를 삭제할 수 있는 것이다.For example, the keyword ‘graphene’ included in the single-word core keyword (91) extracted from the target document (61) and the keyword ‘graffin’ included in the first domain core keyword (94) extracted from the first domain document set (81). If all 'pin' keywords are included in the first domain final core keyword group 160, the domain core keyword extraction system determines that two or more of the same keywords exist, and the 'pin' keyword extracted from the first domain document set 81 The keyword ‘graphene’ can be deleted.

단계 S600과 관련된 몇몇 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 제1 도메인 최종 핵심 키워드 그룹(160)에 포함된 키워드 각각의 임베딩 벡터(67, 68, 69, 83)와 제1 임베딩 벡터(66) 간 유사도를 측정하여, 유사도가 기준치 이하인 임베딩 벡터(67, 68, 69, 83)에 해당하는 키워드를 제1 도메인 최종 핵심 키워드 그룹(160)에서 제외할 수 있다. In some other embodiments related to step S600, the domain core keyword extraction system includes an embedding vector (67, 68, 69, 83) and a first embedding vector (66) for each of the keywords included in the first domain final core keyword group (160). ) can be measured, and keywords corresponding to the embedding vectors (67, 68, 69, 83) whose similarity is below the standard value can be excluded from the first domain final core keyword group (160).

예를 들어, 제1 도메인 문서 집합(81)으로부터 추출되고, 제1 도메인 최종 핵심 키워드 그룹(160)에 포함된 '그라핀' 키워드와 '적용사례' 키워드가 존재하고, '적용사례' 키워드의 임베딩 벡터가 제1 임베딩 벡터(66)에 대한 기준치 이하의 유사도를 가진다는 것에 기초하여, 도메인 핵심 키워드 추출 시스템은 '적용사례' 키워드가 대상 문서(61)를 대표하기에 부적절한 키워드인 것으로 판단하고, 상기 '적용사례' 키워드를 제1 도메인 최종 핵심 키워드 그룹(160)에서 제거할 수 있는 것이다.For example, there are the 'graphene' keyword and the 'application case' keyword extracted from the first domain document set 81 and included in the first domain final core keyword group 160, and the 'application case' keyword Based on the fact that the embedding vector has a similarity below the standard value for the first embedding vector 66, the domain core keyword extraction system determines that the 'Application Case' keyword is an inappropriate keyword to represent the target document 61, and , the 'Application Case' keyword can be removed from the first domain final core keyword group 160.

단계 S600과 관련된 몇몇 또 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 제1 도메인 최종 핵심 키워드 그룹(160)에 포함된 키워드 각각의 임베딩 벡터(67, 68, 69, 83) 간의 유사도를 측정하여, 유사도가 기준치 이상인 임베딩 벡터(67, 68, 69, 83)에 해당하는 키워드를 제1 도메인 최종 핵심 키워드 그룹(160)에서 제외할 수 있다.In some other embodiments related to step S600, the domain core keyword extraction system measures the similarity between the embedding vectors 67, 68, 69, and 83 of each of the keywords included in the first domain final core keyword group 160, Keywords corresponding to the embedding vectors (67, 68, 69, 83) whose similarity is higher than the standard value may be excluded from the first domain final core keyword group (160).

예를 들어, 제1 도메인 문서 집합(81)으로부터 추출되고, 제1 도메인 최종 핵심 키워드 그룹(160)에 포함된 '그라핀' 키워드와, 대상 문서(61)로부터 이중 단어 참조 키워드(92)로 추출되고, 제1 도메인 최종 핵심 키워드 그룹(160)에 포함된 '그라핀 적용사례' 키워드와, 대상 문서(61)로부터 단일 단어 참조 키워드(91)로 추출되고, 제1 도메인 최종 핵심 키워드 그룹(160)에 포함된 '바이오센서' 키워드가 존재할 때, '그라핀 적용사례' 키워드는 '그라핀' 키워드와 기준치 이상의 유사도를 가지므로 제1 도메인 최종 핵심 키워드 그룹(160)에서 제외될 수 있으나, '바이오센서'는 어떤 키워드에 대해서도 기준치 이상의 유사도를 갖지 않으므로 제외되지 않을 수 있는 것이다.For example, the 'graphene' keyword extracted from the first domain document set 81 and included in the first domain final core keyword group 160, and the double word reference keyword 92 from the target document 61. Extracted, the 'graphene application case' keyword included in the first domain final core keyword group 160, and the single word reference keyword 91 from the target document 61 are extracted, and the first domain final core keyword group ( When the 'biosensor' keyword included in 160) exists, the 'graphene application example' keyword may be excluded from the first domain final core keyword group (160) because it has a similarity higher than the standard value with the 'graphene' keyword. 'Biosensor' may not be excluded because it does not have a similarity higher than the standard value for any keyword.

단계 S600과 관련된 몇몇 또 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 임베딩 벡터(67, 68, 69, 83) 각각에 임베딩 벡터(67, 68, 69, 83) 각각의 제1 임베딩 벡터(66)에 대한 유사도를 제1 스코어로 부여하고, 임베딩 벡터(67, 68, 69, 83) 각각에 다른 임베딩 벡터(67, 68, 69, 83)들에 대한 유사도를 제2 스코어로 부여할 수 있다. 또한, 도메인 핵심 키워드 추출 시스템은 상기 제1 스코어와 제2 스코어의 합이 기준치 이상인 임베딩 벡터(67, 68, 69, 83)의 키워드를 제1 도메인 최종 핵심 키워드 그룹(160)에서 제외할 수 있다.In some further embodiments related to step S600, the domain core keyword extraction system may include a first embedding vector 66 in each of the embedding vectors 67, 68, 69, and 83, respectively. The similarity to may be assigned as a first score, and the similarity to other embedding vectors 67, 68, 69, and 83 may be assigned to each of the embedding vectors 67, 68, 69, and 83 as a second score. In addition, the domain core keyword extraction system may exclude keywords of the embedding vectors 67, 68, 69, and 83 in which the sum of the first score and the second score is greater than the standard value from the first domain final core keyword group 160. .

단계 S600과 관련된 몇몇 또 다른 실시예에서, 상기 제1 스코어와 제2 스코어의 합에 기초하여 도메인 핵심 키워드 추출 시스템이 제1 도메인 최종 핵심 키워드를 정제할 때, 도메인 핵심 키워드 추출 시스템은 기 정의된 제한 키워드 추출 개수에 따라 상기 스코어 합의 기준치를 달리할 수 있다.In some other embodiments related to step S600, when the domain core keyword extraction system refines the first domain final core keyword based on the sum of the first score and the second score, the domain core keyword extraction system may use a predefined The score agreement standard may vary depending on the limited number of keywords extracted.

예를 들면, 제1 도메인 최종 핵심 키워드 그룹(160)에 '그라핀' 키워드와, '그라핀 적용사례' 키워드와, '그라핀 적용사례 근접' 키워드와, '바이오센서' 키워드와, '바이오센서 적용사례' 키워드가 존재한다고 가정할 때, 도메인 핵심 키워드 추출 시스템은 기 정의된 제한 키워드 추출 개수가 2개일 때는 '그라핀' 및 '바이오센서' 키워드를 제외한 나머지를 삭제할 수 있으나, 기 정의된 제한 키워드 추출 개수가 3개일 때는 '그라핀' 키워드, '바이오센서' 키워드, '바이오센서 적용사례' 키워드를 제외한 나머지를 삭제할 수 있는 것이다.For example, the first domain final core keyword group 160 includes the keywords 'graphene', 'graphene application cases' keywords, 'graphene application case proximity' keywords, 'biosensor' keywords, and 'bio Assuming that the 'sensor application case' keyword exists, the domain core keyword extraction system can delete the rest except for the 'graphene' and 'biosensor' keywords when the predefined limited number of keywords extracted is 2, but When the limited number of keywords extracted is 3, the rest except for the 'graphene' keyword, 'biosensor' keyword, and 'biosensor application case' keyword can be deleted.

단계 S600과 관련된 몇몇 또 다른 실시예에서, 도메인 핵심 키워드 추출 시스템은 임베딩 벡터(67, 68, 69, 83) 각각에 대한 결정점 프로세스 연산의 수행 결과 산출되는 임베딩 벡터(67, 68, 69, 83) 각각이 제1 도메인 최종 핵심 키워드 그룹(160)에 포함될 확률에 기초하여, 임베딩 벡터(67, 68, 69, 83) 각각의 제1 도메인 최종 핵심 키워드 그룹(160) 포함 여부를 결정할 수 있다.In some other embodiments related to step S600, the domain key keyword extraction system extracts the embedding vectors 67, 68, 69, and 83 that are produced as a result of performing a decision point process operation on each of the embedding vectors 67, 68, 69, and 83. ) Based on the probability that each is included in the first domain final core keyword group 160, it can be determined whether each of the embedding vectors 67, 68, 69, and 83 includes the first domain final core keyword group 160.

여기서, 상기 임베딩 벡터(67, 68, 69, 83) 각각이 제1 도메인 최종 핵심 키워드 그룹(160)에 포함될 확률은, 임베딩 벡터(67, 68, 69, 83) 각각의 행렬식(determinant) 값에 비례하는 것일 수 있다. 또한, 상기 임베딩 벡터(67, 68, 69, 83) 각각의 행렬식은 다음과 같이 산출될 수 있다.Here, the probability that each of the embedding vectors (67, 68, 69, and 83) is included in the first domain final core keyword group (160) is determined by the determinant value of each of the embedding vectors (67, 68, 69, and 83). It may be proportional. Additionally, the determinants of each of the embedding vectors 67, 68, 69, and 83 can be calculated as follows.

예를 들어, 특정 임베딩 벡터 S가 존재한다고 가정할 때,상기 S의 행렬식은 다음과 같이 산출될 수 있다. 단, 상기한 수식에서 ()는 특정 임베딩 벡터에 대한 임의의 푸리에 특징 값을 의미할 수 있고, )는 특정 임베딩 벡터와 제1 임베딩 벡터(66) 간의 유사도를 의미할 수 있다.For example, assuming that a specific embedding vector S exists, the determinant of S can be calculated as follows. However, in the above formula, () may mean an arbitrary Fourier feature value for a specific embedding vector, ) may mean the similarity between a specific embedding vector and the first embedding vector 66.

지금까지 본 개시의 일 실시예에 따른 도메인 핵심 키워드 추출 방법에 관하여 자세히 설명하였다.So far, the domain core keyword extraction method according to an embodiment of the present disclosure has been described in detail.

도 8은 본 개시의 몇몇 실시예들에 따른 도메인 핵심 키워드 추출 시스템의 하드웨어 구성도이다. 도 8에 도시된 도메인 핵심 키워드 추출 시스템(1000)은, 예를 들어 도 1을 참조하여 설명한 도메인 핵심 키워드 추출 시스템(1000)을 가리키는 것일 수 있다.Figure 8 is a hardware configuration diagram of a domain core keyword extraction system according to some embodiments of the present disclosure. For example, the domain core keyword extraction system 1000 shown in FIG. 8 may refer to the domain core keyword extraction system 1000 described with reference to FIG. 1 .

도메인 핵심 키워드 추출 시스템(1000)은 하나 이상의 프로세서(1100), 시스템 버스(1600), 통신 인터페이스(1200), 프로세서(1100)에 의하여 수행되는 컴퓨터 프로그램(1500)을 로드(load)하는 메모리(1400)와, 컴퓨터 프로그램(1500)을 저장하는 스토리지(1300)를 포함할 수 있다.The domain key keyword extraction system 1000 includes one or more processors 1100, a system bus 1600, a communication interface 1200, and a memory 1400 that loads a computer program 1500 executed by the processor 1100. ) and a storage 1300 that stores the computer program 1500.

프로세서(1100)는 도메인 핵심 키워드 추출 시스템(1000)의 각 구성의 전반적인 동작을 제어한다. 프로세서(1100)는 본 개시의 다양한 실시예들에 따른 방법/동작을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. 메모리(1400)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(1400)는 본 개시의 다양한 실시예들에 따른 방법/동작들을 실행하기 위하여 스토리지(1300)로부터 하나 이상의 컴퓨터 프로그램(1500)을 로드(load) 할 수 있다.The processor 1100 controls the overall operation of each component of the domain key keyword extraction system 1000. The processor 1100 may perform operations on at least one application or program to execute methods/operations according to various embodiments of the present disclosure. The memory 1400 stores various data, commands and/or information. The memory 1400 may load one or more computer programs 1500 from the storage 1300 to execute methods/operations according to various embodiments of the present disclosure.

버스(1600)는 도메인 핵심 키워드 추출 시스템(1000)의 구성 요소 간 통신 기능을 제공한다.The bus 1600 provides a communication function between components of the domain key keyword extraction system 1000.

통신 인터페이스(1200)는 도메인 핵심 키워드 추출 시스템(1000)의 인터넷 통신을 지원한다.The communication interface 1200 supports Internet communication of the domain key keyword extraction system 1000.

스토리지(1300)는 하나 이상의 컴퓨터 프로그램(1500)을 비임시적으로 저장할 수 있다.Storage 1300 may non-temporarily store one or more computer programs 1500.

컴퓨터 프로그램(1500)은 본 개시의 다양한 실시예들에 따른 방법/동작들이 구현된 하나 이상의 인스트럭션들(instructions)을 포함할 수 있다. 컴퓨터 프로그램(1500)이 메모리(1400)에 로드 되면, 프로세서(1100)는 상기 하나 이상의 인스트럭션들을 실행시킴으로써 본 개시의 다양한 실시예들에 따른 방법/동작들을 수행할 수 있다.The computer program 1500 may include one or more instructions implementing methods/operations according to various embodiments of the present disclosure. When the computer program 1500 is loaded into the memory 1400, the processor 1100 can perform methods/operations according to various embodiments of the present disclosure by executing the one or more instructions.

예를 들어, 컴퓨터 프로그램(1500)은, 제1 도메인의 문서 집합에서 추출된 제1 도메인 참조 키워드들을 획득하는 동작과, 기 학습된 임베딩 모델을 이용하여, 대상 문서에 대한 제1 임베딩 벡터를 획득하는 동작과, 상기 임베딩 모델을 이용하여, 상기 대상 문서에서 추출된 키워드들 중 일부에 대한 제2 임베딩 벡터들을 획득하는 동작과, 상기 임베딩 모델을 이용하여, 상기 제1 도메인 참조 키워드들 각각에 대한 제3 임베딩 벡터들을 획득하는 동작 및 상기 제2 임베딩 벡터들 및 상기 제3 임베딩 벡터들 중, 상기 제1 임베딩 벡터와 기준치 이상의 유사도를 가지는 임베딩 벡터의 키워드를 상기 대상 문서의 상기 제1 도메인에 대한 핵심 키워드로서 출력하는 동작을 수행하는, 인스트럭션(instruction)들을 포함할 수 있다.For example, the computer program 1500 operates to obtain first domain reference keywords extracted from a document set of the first domain and obtains a first embedding vector for the target document using a previously learned embedding model. An operation of obtaining second embedding vectors for some of the keywords extracted from the target document using the embedding model, and using the embedding model to obtain second embedding vectors for each of the first domain reference keywords. An operation of acquiring third embedding vectors, and among the second and third embedding vectors, a keyword of an embedding vector having a similarity level of more than a reference value to the first embedding vector is used for the first domain of the target document. As key keywords, it may include instructions that perform output operations.

몇몇 실시예들에서, 도메인 핵심 키워드 추출 시스템(1000)은 가상 머신 등 클라우드 기술에 기반하여 서버 팜(server farm)에 포함된 하나 이상의 물리 서버(physical server)를 이용하여 구성될 수 있다. In some embodiments, the domain key keyword extraction system 1000 may be configured using one or more physical servers included in a server farm based on cloud technology such as a virtual machine.

지금까지 도 1 내지 도 8을 참조하여 본 개시의 다양한 실시예들 및 그 실시예들에 따른 효과들을 언급하였다. 본 개시의 기술적 사상에 따른 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.So far, various embodiments of the present disclosure and effects according to the embodiments have been mentioned with reference to FIGS. 1 to 8 . The effects according to the technical idea of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below.

지금까지 설명된 본 개시의 기술적 사상은 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 코드로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체에 기록된 상기 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 상기 다른 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 다른 컴퓨팅 장치에서 사용될 수 있다.The technical ideas of the present disclosure described so far can be implemented as computer-readable code on a computer-readable medium. The computer program recorded on the computer-readable recording medium can be transmitted to another computing device through a network such as the Internet, installed on the other computing device, and thus used on the other computing device.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 이상 첨부된 도면을 참조하여 본 개시의 실시예들을 설명하였지만, 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자는 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 본 발명이 다른 구체적인 형태로도 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 개시에 의해 정의되는 기술적 사상의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although operations are shown in the drawings in a specific order, it should not be understood that the operations must be performed in the specific order shown or sequential order or that all illustrated operations must be performed to obtain the desired results. In certain situations, multitasking and parallel processing may be advantageous. Although embodiments of the present disclosure have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing the technical idea or essential features. I can understand that there is. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive. The scope of protection of the present invention should be interpreted in accordance with the claims below, and all technical ideas within the equivalent scope should be construed as being included in the scope of rights of the technical ideas defined by this disclosure.

Claims

In a method of extracting domain core keywords from a target document on a computing device,
Obtaining first domain reference keywords extracted from a document set of the first domain;
Obtaining a first embedding vector for a target document using a previously learned embedding model;
Obtaining second embedding vectors for some of the keywords extracted from the target document using the embedding model;
Obtaining third embedding vectors for each of the first domain reference keywords using the embedding model; and
Outputting, among the second and third embedding vectors, a keyword of an embedding vector having a similarity level greater than or equal to a reference value with the first embedding vector as a key keyword for the first domain of the target document. doing,
How to extract domain core keywords.

According to claim 1,
The target document is not included in the document set of the first domain,
How to extract domain core keywords.

According to claim 1,
The step of outputting as a core keyword for the first domain is,
Including, outputting keywords of some of the third embedding vectors as core keywords for the first domain of the target document.
How to extract domain core keywords.

According to claim 1,
Among the core keywords for the first domain, keywords that have a similarity of less than a standard value to the first embedding vector and have a similarity of more than a standard value to core keywords of other first domains are excluded from the core keywords of the first domain. further comprising steps;
How to extract domain core keywords.

According to claim 1,
Keywords extracted from the target document are:
including single word keywords, double word keywords and triple word keywords,
How to extract domain core keywords.

According to claim 1,
Based on the determinant value of each vector of the core keywords for the first domain, which is calculated as a result of performing a decision point process operation for each vector of the core keywords for the first domain, the core keywords for the first domain Further comprising the step of excluding some of the keywords from core keywords for the first domain,
How to extract domain core keywords.

In computing systems,
a network interface for receiving a plurality of documents;
Memory into which the domain key keyword extraction program is loaded; and
Including one or more processors executing the domain key keyword extraction program,
The domain core keyword extraction program is,
Instructions for obtaining first domain reference keywords extracted from a document set of the first domain; (instruction);
Instructions for obtaining a first embedding vector for a target document using a previously learned embedding model;
Instructions for obtaining second embedding vectors for some of the keywords extracted from the target document using the embedding model;
Instructions for obtaining third embedding vectors for each of the first domain reference keywords using the embedding model; and
Among the second and third embedding vectors, an instruction for outputting a keyword of an embedding vector having a similarity greater than or equal to a reference value with the first embedding vector as a key keyword for the first domain of the target document; includes; doing,
Domain core keyword extraction system.

According to clause 7,
The target document is not included in the document set of the first domain,
Domain core keyword extraction system.

According to clause 7,
The instructions output as key keywords for the first domain are:
Instructions for outputting keywords of some of the third embedding vectors as core keywords for the first domain of the target document; including,
Domain core keyword extraction system.

According to clause 7,
The domain core keyword extraction program is,
Among the core keywords for the first domain, keywords that have a similarity of less than a standard value to the first embedding vector and have a similarity of more than a standard value to core keywords of other first domains are excluded from the core keywords of the first domain. Further including instructions;
Domain core keyword extraction system.

According to clause 7,
Keywords extracted from the target document are:
including single word keywords, double word keywords and triple word keywords,
Domain core keyword extraction system.

According to clause 7,
The domain core keyword extraction program is,
Based on the determinant value of each vector of the core keywords for the first domain, which is calculated as a result of performing a decision point process operation for each vector of the core keywords for the first domain, the core keywords for the first domain Further comprising instructions for excluding some of the keywords from core keywords for the first domain,
Domain core keyword extraction system.