KR20220081009A

KR20220081009A - Keyword extraction apparatus, control method thereof and keyword extraction program

Info

Publication number: KR20220081009A
Application number: KR1020200170409A
Authority: KR
Inventors: 이다니엘; 고병일; 안애림; 신명철; 김응균
Original assignee: 주식회사 카카오엔터프라이즈
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-15
Also published as: KR102639979B1

Abstract

본 발명은 텍스트 데이터로 구성된 문서 상에서 주요한 키워드를 추출하기 위한 장치 및 제어 방법에 관한 것이다. 보다 구체적으로 본 발명은 저장된 추출대상 문서에서 소정 형태소 패턴에 해당하는 키워드를 후보로 추출하고, 상기 추출된 키워드 후보에 대해서 문법성 및 출현 빈도 중 하나 이상을 고려하여 제 1 필터링을 수행하며, 상기 제 1 필터링된 키워드 후보 간에 문자열이 공통되는지 여부를 고려하여 제 1 중복 제거를 수행하고, 상기 제 1 중복 제거된 키워드 후보 각각에 대해 상기 추출대상 문서 상에서의 랭크를 산출하며, 상기 산출된 랭크에 기초하여 상기 제 1 중복 제거된 키워드 후보에 대해서 제 2 필터링을 수행하고, 상기 제 2 필터링된 키워드 후보 간에 의미적인 중복을 고려하여 제 2 중복 제거하며, 상기 제 2 중복 제거된 키워드 후보에서 추출할 키워드를 선정하는 방식으로 주요 키워드를 추출한다.The present invention relates to an apparatus and a control method for extracting a main keyword from a document composed of text data. More specifically, the present invention extracts a keyword corresponding to a predetermined morpheme pattern from a stored extraction target document as a candidate, and performs first filtering on the extracted keyword candidate in consideration of at least one of grammaticality and frequency of appearance, A first duplicate removal is performed in consideration of whether a character string is common among the first filtered keyword candidates, a rank on the extraction target document is calculated for each of the first removed duplicate keyword candidates, and the calculated rank is A second filtering is performed on the first de-duplicated keyword candidate based on the second filtering, a second de-duplication is removed in consideration of semantic duplication between the second filtered keyword candidates, and a second de-duplication is extracted from the second de-duplicated keyword candidate. Keywords are extracted by selecting keywords.

Description

Main keyword extraction device, its control method, and main keyword extraction program

본 발명은 주요 키워드를 추출하는 장치, 제어 방법 및 프로그램에 관한 것으로, 보다 구체적으로는 문서 등 복수 개의 단어로 구성되는 텍스트 데이터 상에서 이를 대표할 수 있는 키워드를 추출하고 사용자에게 제공해주기 위한 장치, 제어 방법 및 프로그램에 관한 것이다.The present invention relates to an apparatus, a control method, and a program for extracting a main keyword, and more particularly, to an apparatus and control for extracting a keyword that can be represented from text data consisting of a plurality of words such as a document and providing it to a user It relates to methods and programs.

키워드 추출이란, 문서 등 복수 개의 단어로 구성되는 텍스트 데이터 상에서 이를 대표할 수 있는 주요한 키워드를 추출하는 것을 의미한다. 추출된 키워드는 문서 내용을 손쉽게 파악하거나 검색을 하기 위하여 사용될 수 있다. 문서를 대표하는 하나의 키워드를 추출하는 경우도 있지만, 문서의 내용을 손쉽게 파악할 수 있도록 중요도가 높은 2 ~ 5개의 키워드를 추출하여 제공하는 것이 일반적이다. 키워드 추출 기술은, 중요도 높은 키워드를 제공하기 위하여 중요도를 정확하게 판단하는 것도 중요하지만, 복수 개의 주요 키워드들 간에 중복되는 것을 최소화시키는 것 역시 중요하다.Keyword extraction refers to extracting major keywords that can be represented from text data including a plurality of words, such as a document. The extracted keyword can be used to easily understand the contents of the document or to perform a search. In some cases, one keyword representing a document is extracted, but it is common to extract and provide 2 to 5 keywords with high importance so that the contents of the document can be easily understood. In the keyword extraction technique, it is important to accurately determine the importance in order to provide a keyword with high importance, but it is also important to minimize overlap between a plurality of main keywords.

기존 기술들은 다양한 접근 방식을 통하여 키워드를 추출하고자 시도하고 있다. 예를 들어 휴리스틱 정의를 이용하여 후보를 생성하는 '2단계 방법', 문서에서 후보 간의 연관성을 그래프로 표현하는 방식의 'Graph-based ranking 방식' 등 기술들이 일반적으로 많이 사용되는 키워드 추출 기술이다.Existing technologies are attempting to extract keywords through various approaches. For example, techniques such as the 'two-step method' of generating candidates using heuristic definitions and the 'Graph-based ranking method' of expressing the relationship between candidates in a graph in a document are commonly used keyword extraction techniques.

최근 음성 인식 기술이 발달됨에 따라, 사람의 음성을 문자로 전사하는 다양한 응용 프로그램들이 존재한다. 녹음된 음성의 내용을 개략적으로 확인하기 위해, 전사된 문자 상에서 주요 키워드를 추출하기 위한 기술들이 많이 활용되고 있다. 하지만 음성 인식 기술의 특성 상 오인식 문자가 존재할 수밖에 없고, 여러 명의 화자가 존재하기 때문에 동일한 의미의 단어를 다르게 표현하는 경우가 빈번하게 발생한다. 예를 들어 A라는 사람은 '휴대폰'이라는 표현하지만 B라는 사람은 '핸드폰'이라고 표현할 수 있다. '휴대폰'과 '핸드폰'은 문자 자체로는 서로 다른 단어이지만, 내포된 의미(말하고자 하는 의도)는 사실상 동일할 것이다. 키워드 추출 과정에서 이와 같은 단어들을 하나의 단어로 취급해야만, 추출된 키워드의 정확도를 높이고 키워드 간에 중복되는 것을 최소화시킬 수 있을 것이다.With the recent development of speech recognition technology, various applications for transcribing a human voice into text exist. In order to schematically check the contents of the recorded voice, techniques for extracting main keywords from the transcribed text are widely used. However, due to the characteristics of speech recognition technology, misrecognized characters inevitably exist, and since there are multiple speakers, it frequently occurs that words with the same meaning are expressed differently. For example, person A can be expressed as 'mobile phone', but person B can be expressed as 'cell phone'. 'Cell phone' and 'cell phone' are different words by themselves, but the implied meaning (the intended meaning) will be effectively the same. Only when these words are treated as a single word in the keyword extraction process, the accuracy of the extracted keywords can be increased and duplication between keywords can be minimized.

기존에 문서 상에서 주요 키워드를 추출하기 위한 기술들은, 상기와 같은 음성 인식 기술의 특성을 제대로 반영하지 못하다는 문제점이 존재한다. 예를 들어, '휴대폰'과 '핸드폰'을 서로 다른 키워드로 인식하여 주요 키워드로 중복 제공해 주는 문제점이 있을 수도 있으며, 서로 다른 단어라고 인식하여 정확한 중요도 계산이 이루어지지 못하는 문제점도 있을 것이다.Existing techniques for extracting major keywords from a document have a problem in that they do not properly reflect the characteristics of the speech recognition technique as described above. For example, there may be a problem in that 'mobile phone' and 'cell phone' are recognized as different keywords and provided as main keywords.

이에 따라 인식된 음성을 전사한 문서 상에서 최적화된 키워드 추출 기술에 대한 연구가 요구되는 실정이다.Accordingly, there is a need for research on a keyword extraction technology optimized from a document in which a recognized voice is transcribed.

본 발명이 해결하고자 하는 과제는 문서 등 텍스트 데이터에서 주요 키워드를 추출하는 장치, 제어 방법 및 프로그램을 제공하는 것이다.SUMMARY OF THE INVENTION An object of the present invention is to provide an apparatus, a control method, and a program for extracting main keywords from text data such as a document.

본 발명이 해결하고자 하는 다른 과제는 음성 전사 문서에 최적화된 키워드 추출 장치, 제어 방법 및 프로그램을 제공하는 것이다.Another object to be solved by the present invention is to provide an apparatus for extracting keywords, a control method, and a program optimized for an audio transcription document.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. will be able

상기 또는 다른 과제를 해결하기 위해 본 발명의 일 측면에 따르면, 상기 추출대상 문서에서 소정 형태소 패턴에 해당하는 키워드를 후보로 추출하는 단계; 상기 키워드 후보에 대해서 문법성 및 출현 빈도 중 하나 이상을 고려하여 제 1 필터링을 수행하는 단계; 상기 키워드 후보에 대해 상기 추출대상 문서 상에서의 랭크를 산출하는 단계; 상기 산출된 랭크에 기초하여 상기 키워드 후보에 대해서 제 2 필터링을 수행하는 단계; 및 추출할 키워드를 선정하는 단계를 포함하는 것을 특징으로 하는, 키워드 추출 장치의 제어 방법을 제공한다.According to one aspect of the present invention in order to solve the above or other problems, extracting a keyword corresponding to a predetermined morpheme pattern from the extraction target document as a candidate; performing first filtering on the keyword candidate in consideration of at least one of grammaticality and frequency of appearance; calculating a rank on the extraction target document for the keyword candidate; performing second filtering on the keyword candidates based on the calculated rank; and selecting a keyword to be extracted.

상기 키워드 후보 간에 문자열이 공통되는지 여부를 고려하여 제 1 중복 제거를 수행하는 단계를 더 포함할 수 있다.The method may further include performing the first duplicate removal in consideration of whether a character string is common among the keyword candidates.

상기 제 1 중복 제거를 수행하는 단계는, 상기 키워드 후보 간에 문자열이 공통되는지 여부를 고려하여 복수 개의 그룹으로 클러스터링하는 단계; 및 상기 클러스터링 된 복수 개의 그룹 각각의 대표 키워드를 제외한 나머지 키워드 후보를 제거하는 단계를 포함할 수 있다.The performing of the first duplicate removal may include: clustering into a plurality of groups in consideration of whether a character string is common among the keyword candidates; and removing the remaining keyword candidates except for the representative keyword of each of the plurality of clustered groups.

상기 클러스터링하는 단계는, 상기 키워드 후보 간에 LCS(Longest Common Subsequence)를 산출하는 단계; 및 상기 산출된 LCS의 길이가 소정값 이상인 키워드 후보들을 같은 그룹으로 묶는 단계를 포함할 수 있다.The clustering may include: calculating a Longest Common Subsequence (LCS) between the keyword candidates; and grouping keyword candidates having the calculated LCS length equal to or greater than a predetermined value into the same group.

상기 제 1 필터링을 수행하는 단계는, 상기 키워드 후보에 대해서 문법성 및 출현 빈도 중 하나 이상을 고려하여 중요도를 산출하는 단계; 및 상기 산출된 중요도가 낮은 키워드 후보를 필터링하는 단계를 포함할 수 있다.The performing of the first filtering may include calculating the importance of the keyword candidate in consideration of at least one of grammaticality and frequency of appearance; and filtering the calculated keyword candidates having low importance.

상기 중요도를 산출하는 단계는 상기 문법성을 고려하는데 있어서, 복수 개의 뉴스 콘텐츠를 입력 받는 단계; 상기 입력 받은 뉴스 콘텐츠에 기초하여 문법성 언어 모델을 학습하는 단계; 및 상기 학습한 문법성 언어 모델에 기초하여 중요도를 산출하는 단계를 포함할 수 있다.The calculating of the importance may include: receiving a plurality of news contents in consideration of the grammar; learning a grammatical language model based on the received news content; and calculating the importance level based on the learned grammatical language model.

상기 중요도를 산출하는 단계는 상기 출현 빈도를 고려하는데 있어서, 상기 추출대상 문서에 기초하여 정보성 언어 모델을 학습하는 단계; 및 상기 학습한 정보성 언어 모델에 기초하여 중요도를 산출하는 단계를 포함할 수 있다.Calculating the importance may include: in considering the frequency of appearance, learning an informational language model based on the extraction target document; and calculating the importance level based on the learned informational language model.

상기 키워드 후보 간에 의미적인 중복을 고려하여 제 2 중복 제거하는 단계를 더 포함할 수 있다.The method may further include removing a second duplication in consideration of semantic duplication between the keyword candidates.

상기 랭크를 산출하는 단계는, 상기 추출대상 문서 상에서 소정 키워드의 카운트 값 및 상기 소정 키워드가 포함되어 있는 문장의 카운트 값 중 하나 이상을 고려하여 상기 랭크를 산출할 수 있다.The calculating of the rank may include calculating the rank in consideration of at least one of a count value of a predetermined keyword and a count value of a sentence including the predetermined keyword on the extraction target document.

상기 제 2 필터링을 수행하는 단계는, 상기 키워드 후보 중에서 상기 산출된 랭크가 낮은 키워드를 제거하는 방식으로 이루어질 수 있다.The performing of the second filtering may be performed in such a way that a keyword having a low calculated rank is removed from among the keyword candidates.

상기 또는 다른 과제를 해결하기 위해 본 발명의 다른 측면에 따르면, 추출대상 문서를 저장하는 메모리; 및 상기 저장된 추출대상 문서 상에서 키워드를 추출하는 제어부를 포함하되, 상기 제어부는, 상기 추출대상 문서에서 소정 형태소 패턴에 해당하는 키워드를 후보로 추출하고, 상기 키워드 후보에 대해서 문법성 및 출현 빈도 중 하나 이상을 고려하여 제 1 필터링을 수행하며, 상기 키워드 후보에 대해 상기 추출대상 문서 상에서의 랭크를 산출하고, 상기 산출된 랭크에 기초하여 상기 키워드 후보에 대해서 제 2 필터링을 수행하며, 추출할 키워드를 선정하는 것을 특징으로 하는, 키워드 추출 장치를 제공한다.According to another aspect of the present invention to solve the above or other problems, a memory for storing an extraction target document; and a control unit for extracting a keyword from the stored extraction target document, wherein the control unit extracts a keyword corresponding to a predetermined morpheme pattern from the extraction target document as a candidate, and for the keyword candidate, one of grammatical properties and frequency of appearance In consideration of the above, the first filtering is performed, the rank on the extraction target document is calculated for the keyword candidate, the second filtering is performed on the keyword candidate based on the calculated rank, and the keyword to be extracted is selected. It provides a keyword extraction device, characterized in that the selection.

상기 제어부는, 상기 키워드 후보 간에 문자열이 공통되는지 여부를 고려하여 제 1 중복 제거를 수행할 수 있다.The controller may perform the first duplicate removal in consideration of whether a character string is common among the keyword candidates.

상기 제어부가 상기 제 1 중복 제거를 수행하는데 있어서, 상기 키워드 후보 간에 문자열이 공통되는지 여부를 고려하여 복수 개의 그룹으로 클러스터링하고, 상기 클러스터링 된 복수 개의 그룹 각각의 대표 키워드를 제외한 나머지 키워드 후보를 제거할 수 있다.When the control unit performs the first duplicate removal, it is clustered into a plurality of groups in consideration of whether the character string is common among the keyword candidates, and the remaining keyword candidates except for the representative keyword of each of the clustered groups are removed. can

상기 제어부가 상기 클러스터링하는데 있어서, 상기 키워드 후보 간에 LCS(Longest Common Subsequence)를 산출하고, 상기 산출된 LCS의 길이가 소정값 이상인 키워드 후보들을 같은 그룹으로 묶을 수 있다.In the clustering, the controller may calculate a Longest Common Subsequence (LCS) between the keyword candidates, and group keyword candidates having the calculated LCS length equal to or greater than a predetermined value into the same group.

상기 제어부가 상기 제 1 필터링을 수행하는데 있어서, 상기 키워드 후보에 대해서 문법성 및 출현 빈도 중 하나 이상을 고려하여 중요도를 산출하고, 상기 산출된 중요도가 낮은 키워드 후보를 필터링할 수 있다.When the control unit performs the first filtering, the importance of the keyword candidate is calculated in consideration of at least one of grammaticality and frequency of appearance, and the calculated keyword candidates having low importance may be filtered.

상기 제어부가 상기 문법성을 고려하는데 있어서, 복수 개의 뉴스 콘텐츠를 입력 받고, 상기 입력 받은 뉴스 콘텐츠에 기초하여 문법성 언어 모델을 학습하며, 상기 학습한 문법성 언어 모델에 기초하여 중요도를 산출할 수 있다.When the control unit considers the grammatical property, it is possible to receive a plurality of news contents, learn a grammatical language model based on the received news contents, and calculate an importance based on the learned grammatical language model. have.

상기 제어부가 상기 출현 빈도를 고려하는데 있어서, 상기 추출대상 문서에 기초하여 정보성 언어 모델을 학습하고, 상기 학습한 정보성 언어 모델에 기초하여 중요도를 산출할 수 있다.When the control unit considers the frequency of appearance, an informational language model may be learned based on the extraction target document, and an importance may be calculated based on the learned informational language model.

상기 제어부는, 상기 키워드 후보 간에 의미적인 중복을 고려하여 제 2 중복 제거를 수행할 수 있다.The controller may perform the second duplication removal in consideration of semantic duplication between the keyword candidates.

상기 제어부가 상기 랭크를 산출하는데 있어서, 상기 추출대상 문서 상에서 소정 키워드의 카운트 값 및 상기 소정 키워드가 포함되어 있는 문장의 카운트 값 중 하나 이상을 고려하여 상기 랭크를 산출할 수 있다.When calculating the rank, the controller may calculate the rank in consideration of at least one of a count value of a predetermined keyword and a count value of a sentence including the predetermined keyword on the extraction target document.

상기 제어부가 상기 제 2 필터링을 수행하는데 있어서, 상기 키워드 후보 중에서 상기 산출된 랭크가 낮은 키워드를 제거하는 방식으로 이루어질 수 있다.When the control unit performs the second filtering, it may be performed in such a way that a keyword having a low calculated rank is removed from among the keyword candidates.

본 발명에 따른 주요 키워드 추출 장치, 그것의 제어 방법 및 프로그램의 효과에 대해 설명하면 다음과 같다.The main keyword extraction apparatus according to the present invention, the control method thereof, and the effects of the program will be described as follows.

본 발명의 실시 예들 중 적어도 하나에 의하면, 문서를 대표하기 위한 주요 키워드를 높은 정확도로 추출하면서 동시에 추출되는 키워드 간에 중복을 최소화시킬 수 있다는 장점이 있다.According to at least one of the embodiments of the present invention, there is an advantage in that it is possible to extract a main keyword for representing a document with high accuracy while simultaneously minimizing duplication between extracted keywords.

또한, 본 발명의 실시 예들 중 적어도 하나에 의하면, 여러 화자의 음성을 인식한 문서에서 추출되는 키워드 간에 중복되는 것을 최소화시킬 수 있다는 장점이 있다.In addition, according to at least one of the embodiments of the present invention, there is an advantage in that it is possible to minimize overlap between keywords extracted from a document recognizing the voices of several speakers.

본 발명의 적용 가능성의 추가적인 범위는 이하의 상세한 설명으로부터 명백해질 것이다. 그러나 본 발명의 사상 및 범위 내에서 다양한 변경 및 수정은 당업자에게 명확하게 이해될 수 있으므로, 상세한 설명 및 본 발명의 바람직한 실시 예와 같은 특정 실시 예는 단지 예시로 주어진 것으로 이해되어야 한다.Further scope of applicability of the present invention will become apparent from the following detailed description. However, it should be understood that the detailed description and specific embodiments such as preferred embodiments of the present invention are given by way of example only, since various changes and modifications within the spirit and scope of the present invention may be clearly understood by those skilled in the art.

도 1은 일반적인 음성 인식 기술의 대표적인 활용 사례를 도시하는 도면이다.
도 2는 본 발명의 일실시예에 따른 주요 키워드 추출 장치(100)의 블록도를 도시하는 도면이다.
도 3은 본 발명의 일실시예에 따른 키워드 추출 장치(100)의 제어 순서도를 도시하는 도면이다.
도 4는 본 발명의 일실시예에 따라 추출된 키워드를 출력하는 예시이다.
도 5는 본 발명의 일실시예에 따른 키워드 추출의 구체적인 순서도를 도시하는 도면이다.
도 6은 본 발명의 일실시예에 따른 두 가지 종류의 언어 모델의 학습 개념도를 도시하는 도면이다.
도 7은 본 발명의 일실시예에 따라, 공통되는 문자열을 기준으로 키워드 중복을 제거(S504 단계)하는 제어 순서도를 도시하는 도면이다.
도 8은 공통되는 문자열을 기준으로 키워드 중복을 제거하는 개념 예시를 도시한다.
도 9는 본 발명의 일실시예에 따른 제 2 가중치 산출 예시를 도시한다.
도 10은 본 발명의 일실시예에 따라 의미적 중복을 고려하여 키워드 중복을 제거하는 개념도를 도시하는 도면이다.
도 11은 본 발명의 일실시예에 따른 금칙어 필터링의 개념도를 도시하는 도면이다.1 is a diagram illustrating a typical application example of a general speech recognition technology.
2 is a diagram showing a block diagram of the main keyword extraction apparatus 100 according to an embodiment of the present invention.
3 is a diagram illustrating a control flowchart of the keyword extraction apparatus 100 according to an embodiment of the present invention.
4 is an example of outputting keywords extracted according to an embodiment of the present invention.
5 is a diagram illustrating a detailed flowchart of keyword extraction according to an embodiment of the present invention.
6 is a diagram illustrating a learning conceptual diagram of two types of language models according to an embodiment of the present invention.
7 is a diagram illustrating a control flowchart for removing duplicate keywords based on a common character string (step S504) according to an embodiment of the present invention.
8 shows an example of a concept of removing duplicate keywords based on a common character string.
9 shows an example of calculating a second weight according to an embodiment of the present invention.
10 is a diagram illustrating a conceptual diagram of removing keyword duplication in consideration of semantic duplication according to an embodiment of the present invention.
11 is a diagram illustrating a conceptual diagram of filtering nonsense words according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, the embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numbers regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "part" for components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves. In addition, in describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in this specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical idea disclosed herein is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention , should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including an ordinal number such as 1st, 2nd, etc. may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present application, terms such as “comprises” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

이하에서 설명되는 본 발명의 실시예들은, 음성 전사 문서에 최적화된 키워드 추출 기술에 관한 것이지만, 본원 발명이 음성 전사 문서에 한정되지 않고 일반 문서에도 확장 적용될 수 있음은 자명하다. 또한 본 발명에서 추출되는 키워드는 하나의 단어일 수도 있지만, 두 개 이상의 단어 조합(이하에서는 두 개 이상의 단어 조합을 단어 시퀀스라 표현함)을 의미할 수도 있다. 예를 들어, 본 발명의 일실시예에 따라 추출된 키워드는 '조사 결과'와 같이 두 개의 단어가 조합된 결과이거나, '수중 침투 가능성'과 같이 세 개의 단어 조합일 수도 있다.Although the embodiments of the present invention described below relate to keyword extraction techniques optimized for voice transcription documents, it is obvious that the present invention is not limited to voice transcription documents and can be extended and applied to general documents. In addition, the keyword extracted in the present invention may be a single word, but may also mean a combination of two or more words (hereinafter, a combination of two or more words is expressed as a word sequence). For example, the keyword extracted according to an embodiment of the present invention may be a combination of two words, such as 'search result', or a combination of three words, such as 'underwater penetration possibility'.

도 1은 일반적인 음성 인식 기술의 대표적인 활용 사례를 도시하는 도면이다.1 is a diagram illustrating a typical application example of a general speech recognition technology.

음성 인식 장치(112)는 복수 명의 화자(111-1 ~ 111-3)로부터 음성 입력이 수신되면, 입력된 음성을 인식하고 텍스트로 전사한 음성 전사 문서(200)를 생성할 수 있다. 예를 들어, 다자간 회의하는 녹음 파일로 회의록을 만들고자 할 때 음성 인식 장치(112)를 통하여 음성 전사 문서(200)로 변환시킬 수 있다. 더 나아가 음성 인식 장치(112)가 음성 인식 시 화자를 구분한다면, 도 1에 도시된 바와 같이 음성 전사 문서(200)에 전사되는 음성은 화자별로 구분되어 표시(201)될 수 있을 것이다. 회의록의 경우, 화자를 구분하는 것이 중요할 수 있기 때문이다.When a voice input is received from the plurality of speakers 111-1 to 111-3, the voice recognition apparatus 112 may recognize the input voice and generate the voice transcription document 200 transcribed into text. For example, when it is desired to make a meeting record with a recording file of a multi-party meeting, it may be converted into a voice transcription document 200 through the voice recognition device 112 . Furthermore, if the voice recognition apparatus 112 identifies a speaker during voice recognition, as shown in FIG. 1 , the voice transcribed into the voice transcription document 200 may be divided and displayed 201 for each speaker. This is because, in the case of minutes, it can be important to distinguish the speakers.

음성 전사 문서가 아닌 일반적인 문서의 경우, 특정 의미를 가지는 용어를 하나로 통일하여 작성된다. 하지만, 음성의 경우 같은 의미를 가지는 용어에 대해서 다양한 방식으로 표현하는 경우가 많다. 한 사람이 다양하게 표현하는 경우도 많지만, 사람 마다 자신만의 표현을 쓰는 경우가 많기 때문이다. 예를 들어서 A라는 사람이 '자동차'라고 표현하지만, B라는 사람은 '차'라고 표현하거나 '차량'으로 표현하는 경우가 있다. 이렇게 표현이 다양한 것을 반영하지 않고 키워드를 추출할 경우, 추출된 결과에 있어서 신뢰성이 낮아질 수밖에 없다. 이에 따라 이하에서 설명하는 본 발명의 일실시예에서는, 보다 높은 정확도로 주요 키워드를 추출할 수 있는 방법에 대해서 제안한다.In the case of a general document that is not an audio transcription document, terms having a specific meaning are unified into one. However, in the case of voice, terms having the same meaning are often expressed in various ways. This is because there are many cases where one person expresses it in various ways, but there are many cases where each person uses their own expression. For example, a person named A may be expressed as a 'car', but a person named B may be expressed as a 'car' or a 'vehicle'. If keywords are extracted without reflecting various expressions in this way, the reliability of the extracted results is inevitably lowered. Accordingly, in an embodiment of the present invention described below, a method for extracting a main keyword with higher accuracy is proposed.

도 2는 본 발명의 일실시예에 따른 주요 키워드 추출 장치(100, 이하 키워드 추출 장치라 한다)의 블록도를 도시하는 도면이다.2 is a diagram illustrating a block diagram of a main keyword extraction apparatus 100 (hereinafter referred to as a keyword extraction apparatus) according to an embodiment of the present invention.

본 발명의 일 실시예에 따르면, 키워드 추출 장치(100)는 모바일 폰, 셀룰러 폰, 스마트 폰, 퍼스널 컴퓨터, 랩탑, 노트북, 넷북 또는 태블릿, 휴대 정보 단말기(personal digital assistant; PDA), 디지털 카메라, 게임 콘솔, MP3 플레이어, 퍼스널 멀티미디어 플레이어(personal multimedia player; PMP), 전자 북(E-Book), 네비게이션, 디스크 플레이어, 셋톱박스, 가정용 전자기기(home appliance), 통신 장치, 디스플레이 장치, 또는 다른 전자기기에 내장되거나 또는 이것들과 상호 동작할 수 있다. 또한, 키워드 추출 장치(100)는 스마트 가전 기기, 지능형 차량, 자율 주행 장치, 스마트 홈 환경, 스마트 빌딩 환경, 스마트 오피스 환경, 스마트 전자 보안 시스템 등에 내장되거나 또는 이것들과 상호 동작할 수 있다.According to an embodiment of the present invention, the keyword extraction device 100 is a mobile phone, a cellular phone, a smart phone, a personal computer, a laptop, a notebook computer, a netbook or tablet, a personal digital assistant (PDA), a digital camera, Game console, MP3 player, personal multimedia player (PMP), electronic book (E-Book), navigation, disk player, set-top box, home appliance, communication device, display device, or other electronic device It may be built into or interact with the device. In addition, the keyword extraction device 100 may be embedded in or interact with a smart home appliance, an intelligent vehicle, an autonomous driving device, a smart home environment, a smart building environment, a smart office environment, a smart electronic security system, or the like.

키워드 추출 장치(100)는 마이크(101), 통신부(102), 제어부(103), 출력부(104), 메모리(105) 및 전원 공급부(106) 등을 포함할 수 있다. 도 2에 도시된 구성요소들은 키워드 추출 장치(100)를 구현하는데 있어서 필수적인 것은 아니어서, 본 명세서 상에서 설명되는 키워드 추출 장치(100)는 위에서 열거된 구성요소들 보다 많거나, 또는 적은 구성요소들을 가질 수 있다.The keyword extraction apparatus 100 may include a microphone 101 , a communication unit 102 , a control unit 103 , an output unit 104 , a memory 105 , and a power supply unit 106 . The components shown in FIG. 2 are not essential in implementing the keyword extraction apparatus 100, so the keyword extraction apparatus 100 described in this specification is more or less than the components listed above. can have

마이크(101)는 외부의 음향 신호를 전기적인 음성 데이터로 처리한다. 처리된 음성 데이터는 키워드 추출 장치(100)에서 수행 중인 기능(또는 실행 중인 응용 프로그램)에 따라 다양하게 활용될 수 있다. 한편, 마이크(101)에는 외부의 음향 신호를 입력 받는 과정에서 발생되는 잡음(noise)을 제거하기 위한 다양한 잡음 제거 알고리즘이 구현될 수 있다.The microphone 101 processes an external sound signal as electrical voice data. The processed voice data may be variously utilized according to a function (or a running application program) being performed by the keyword extraction apparatus 100 . Meanwhile, various noise removal algorithms for removing noise generated in the process of receiving an external sound signal may be implemented in the microphone 101 .

제어부(103)는 통상적으로 키워드 추출 장치(100)의 전반적인 동작을 제어한다. 제어부(103)는 위에서 살펴본 구성요소들을 통해 입력 또는 출력되는 신호, 데이터, 정보 등을 처리함으로써, 사용자에게 적절한 정보 또는 기능을 제공 또는 처리할 수 있다.The control unit 103 generally controls the overall operation of the keyword extraction apparatus 100 . The controller 103 may provide or process appropriate information or functions to the user by processing signals, data, information, etc. input or output through the above-described components.

특히 본 발명의 일실시예에 따른 제어부(103)는, 상기 마이크(101)를 통하여 입력되는 오디오 신호로부터 음성을 인식하고, 인식된 음성으로부터 주요 키워드를 추출할 수 있을 것이다. 또는 메모리(105)에 저장되어 있는 추출대상 문서에서 주요 키워드를 추출할 수 있을 것이다. 주요 키워드의 구체적인 추출 방법에 대해서는 이하 상세하게 후술한다.In particular, the controller 103 according to an embodiment of the present invention may recognize a voice from an audio signal input through the microphone 101 and extract a main keyword from the recognized voice. Alternatively, the main keyword may be extracted from the document to be extracted stored in the memory 105 . A specific extraction method of the main keyword will be described later in detail.

또한, 메모리(105)는 키워드 추출 장치(100)의 다양한 기능을 지원하는 데이터를 저장한다. 메모리(105)는 키워드 추출 장치(100)의 동작을 위한 데이터들, 명령어들을 저장할 수 있다. 또한 메모리(105)는 이하에서 후술하는 '추출대상 문서'를 저장할 수 있다.In addition, the memory 105 stores data supporting various functions of the keyword extraction apparatus 100 . The memory 105 may store data and commands for the operation of the keyword extraction apparatus 100 . In addition, the memory 105 may store an 'extraction target document' to be described later.

통신부(102)는, 키워드 추출 장치(100)와 유무선 통신 시스템 사이, 키워드 추출 장치(100)와 이동 단말기 사이, 또는 키워드 추출 장치(100)와 외부서버 사이의 유무선 통신을 가능하게 하는 하나 이상의 모듈을 포함할 수 있다. 또한, 상기 통신부(102)는, 키워드 추출 장치(100)를 하나 이상의 네트워크에 연결하는 하나 이상의 모듈을 포함할 수 있다. 음성 인식 결과 및/또는 키워드 추출 결과는 통신부(102)를 통하여 다른 이동 단말기나 외부 서버에 전송될 수 있다.The communication unit 102, one or more modules that enable wired/wireless communication between the keyword extraction device 100 and a wired/wireless communication system, between the keyword extraction device 100 and a mobile terminal, or between the keyword extraction device 100 and an external server may include In addition, the communication unit 102 may include one or more modules for connecting the keyword extraction device 100 to one or more networks. The voice recognition result and/or the keyword extraction result may be transmitted to another mobile terminal or an external server through the communication unit 102 .

출력부(104)는 음성 전사 결과 및/또는 추출된 키워드를 출력하기 위한 구성으로, 디스플레이를 포함할 수 있다. 디스플레이를 통하여 음성 인식 결과를 출력해 줄 수도 있지만, 텍스트 데이터 형태로 출력될 수도 있을 것이다. 텍스트 데이터로 출력되는 경우, 출력부(104)는 상술한 메모리(105)에 텍스트 데이터를 저장하고, 사용자가 저장된 텍스트 데이터를 확인하는 형태로 음성 전사 결과를 제공할 수 있을 것이다.The output unit 104 is configured to output the voice transcription result and/or the extracted keyword, and may include a display. The voice recognition result may be output through the display, but may also be output in the form of text data. When output as text data, the output unit 104 may store the text data in the above-described memory 105 and provide a voice transcription result in a form in which the user confirms the stored text data.

한편, 음성 전사 결과 및/또는 추출된 키워드는, 상기 통신부(102)를 통하여 연결되는 이동 단말기의 디스플레이를 통하여 제공될 수도 있을 것이다.Meanwhile, the voice transcription result and/or the extracted keyword may be provided through a display of a mobile terminal connected through the communication unit 102 .

전원 공급부(106)는 제어부(103)의 제어 하에서, 외부의 전원, 내부의 전원을 인가 받아 키워드 추출 장치(100)에 포함된 각 구성요소들에 전원을 공급한다.The power supply unit 106 receives external power and internal power under the control of the control unit 103 to supply power to each component included in the keyword extraction device 100 .

도 3은 본 발명의 일실시예에 따른 키워드 추출 장치(100)의 제어 순서도를 도시하는 도면이다.3 is a diagram illustrating a control flowchart of the keyword extraction apparatus 100 according to an embodiment of the present invention.

S1100 단계에서 키워드 추출 장치(100)는 음성 입력을 수신한다. 키워드 추출 장치(100)는 자체적으로 마이크(101)를 구비하고 이를 통하여 음성 입력을 수신할 수도 있지만, 외부 이동 단말기의 마이크를 통하여 수신한 음성 입력을 통신부(102)를 통하여 전달 받는 형태로 수신할 수도 있을 것이다.In step S1100, the keyword extraction apparatus 100 receives a voice input. Although the keyword extraction device 100 has its own microphone 101 and may receive a voice input through it, the voice input received through the microphone of the external mobile terminal may be received in the form of being transmitted through the communication unit 102 . it might be

S1101 단계에서 키워드 추출 장치(100)의 제어부(103)는 수신된 음성 입력에서 음성을 인식한다. 이렇게 인식된 음성은, 복수 개의 단어로 구성되는 텍스트 데이터일 것이다. 그리고 S1102 단계에서 키워드 추출 장치(100)의 제어부(103)는 인식된 음성(즉 텍스트 데이터)에서 주요 키워드를 추출한다. 그리고 S1103 단계에서 키워드 추출 장치(100)의 제어부(103)는 출력부(104)를 통하여 상기 추출된 주요 키워드를 출력해 줄 수 있다. 추출된 주요 키워드가 제공되는 예시는 이하 도 4를 참조하여 설명한다.In step S1101, the controller 103 of the keyword extraction apparatus 100 recognizes a voice from the received voice input. The recognized voice may be text data composed of a plurality of words. And in step S1102, the control unit 103 of the keyword extraction apparatus 100 extracts a main keyword from the recognized voice (ie, text data). In step S1103 , the control unit 103 of the keyword extraction apparatus 100 may output the extracted main keyword through the output unit 104 . An example in which the extracted main keyword is provided will be described below with reference to FIG. 4 .

도 4는 본 발명의 일실시예에 따라 추출된 키워드를 출력하는 예시이다. 본 예시에서의 추출된 키워드는 키워드 추출 장치(100) 자체에서 출력될 수도 있지만, 키워드 추출 장치(100)에서 수행된 음성인식 및/또는 키워드 추출 결과가 외부 이동 단말기로 전송되고, 이를 수신 받은 이동 단말기 상에서 표시될 수도 있을 것이다.4 is an example of outputting keywords extracted according to an embodiment of the present invention. The extracted keyword in this example may be output from the keyword extraction device 100 itself, but the result of voice recognition and/or keyword extraction performed in the keyword extraction device 100 is transmitted to an external mobile terminal, and the mobile It may be displayed on the terminal.

도시된 도면의 예시에서의 출력 화면에서는, 음성 인식 출력 영역(402) 및 추출 키워드 영역(401)을 포함한다. 마이크(101)를 통하여 오디오 신호의 형태로 음성 입력이 수신되면, 제어부(103)는 오디오 신호에서 음성을 인식하고, 음성 인식 출력 영역(402)에 인식한 결과를 출력할 수 있다. 그리고 본 발명의 일실시예에 따라 주요 키워드가 추출되면, 적어도 하나의 주요 키워드가 상기 추출 키워드 영역(401) 상에 출력되어 사용자에게 제공될 수 있다.The output screen in the illustrated example includes a voice recognition output area 402 and an extracted keyword area 401 . When a voice input in the form of an audio signal is received through the microphone 101 , the controller 103 may recognize a voice from the audio signal and output the recognized result to the voice recognition output area 402 . And when a main keyword is extracted according to an embodiment of the present invention, at least one main keyword may be output on the extracted keyword area 401 and provided to the user.

이하에서는 텍스트 데이터 상에서 주요 키워드를 추출하는 구체적인 방법(S1102 단계)에 대해서 설명한다.Hereinafter, a specific method (step S1102) of extracting a main keyword from text data will be described.

도 5는 본 발명의 일실시예에 따른 키워드 추출의 구체적인 순서도를 도시하는 도면이다.5 is a diagram illustrating a detailed flowchart of keyword extraction according to an embodiment of the present invention.

S501 단계에서 제어부(103)는 키워드를 추출하기 위한 대상인 텍스트 데이터의 형태소를 분석한다. 여기서 텍스트 데이터는 복수 개의 단어로 구성될 수 있으며, 상기 S1101 단계에서 음성인식 된 결과일 수 있다. 이하, 키워드 추출 대상 텍스트 데이터를 간편하게 '추출대상 문서'라고 호칭하기로 한다.In step S501, the control unit 103 analyzes the morpheme of text data, which is a target for extracting a keyword. Here, the text data may consist of a plurality of words, and may be a result of voice recognition in step S1101. Hereinafter, the keyword extraction target text data will be simply referred to as an 'extraction target document'.

S502 단계에서 제어부(103)는 소정 형태소 패턴에 해당하는 단어(들)를 키워드 후보로 추출한다. 예를 들어 'NNG XSN NNG'이라는 형태소 패턴에 해당하는 단어(들)를 키워도 후보로 추출할 수 있다. 아래 표 1은 형태소 코드의 일부이다. 즉, 본원 발명에서는 일반적으로 키워드로 사용되는 형태소들의 패턴(들)을 기저장하고, 해당 형태소 패턴을 갖는 키워드를 후보로 선정하도록 하는 것이다.In step S502, the controller 103 extracts word(s) corresponding to a predetermined morpheme pattern as keyword candidates. For example, even if the word(s) corresponding to the morpheme pattern of 'NNG XSN NNG' is grown, it can be extracted as a candidate. Table 1 below is a part of the morpheme code. That is, in the present invention, pattern(s) of morphemes generally used as keywords are stored in advance, and a keyword having the corresponding morpheme pattern is selected as a candidate.

형태소 코드morpheme code 설명Explanation 형태소 코드morpheme code 설명Explanation NNGNNG 일반 명사common noun ICIC 감탄사interjection NNPNNP 고유 명사proper noun JKSJKS 주격 조사nominative investigation NNBNNB 의존 명사dependent noun JKCJKC 보격 조사supplemental investigation NRNR 수사Investigation JKGJKG 관형격 조사tubular case study NPNP 대명사pronoun JKOJKO 목적격 조사objective investigation VVVV 동사verb JKBJKB 부사격 조사sub-firing investigation VAVA 형용사adjective JKVJKV 호격 조사assault investigation VXVX 보조 용언auxiliary verb JKQJKQ 인용격 조사Quotation Investigation VCPVCP 긍정 지정사affirmative adjective JXJX 보조사assistant VCNVCN 부정 지정사negative designator JCJC 접속 조사connection investigation MMMM 관형사detective EPEP 선어말 어미fresh fish mother MAGMAG 일반 부사common adverb EFEF 종결 어미terminating ending MAJMAJ 접속 부사conjunctive adverb SNSN 숫자number ......

본 발명의 일실시예에서는, 신뢰성 높은 텍스트 데이터를 이용하여 상기 형태소 패턴을 학습하도록 제안한다. 특히 본 발명의 일실시예에서는, 신뢰성 높은 텍스트 데이터로 뉴스 기사를 사용하도록 제안한다. 더 나아가, 뉴스 기사 중에서 본문을 제외한 제목만을 사용할 수 있다. 왜냐하면, 뉴스 본문에서 중요도가 높은 키워드는 뉴스 제목에 배치되는 것이 일반적이므로 키워드 후보를 추출하기 위한 형태소 패턴을 학습하기에 최적이기 때문이다.In one embodiment of the present invention, it is proposed to learn the morpheme pattern using highly reliable text data. In particular, in one embodiment of the present invention, it is proposed to use a news article as text data with high reliability. Furthermore, only the headline excluding the main body of the news article may be used. This is because, in general, keywords with high importance in the news body are placed in news titles, so they are optimal for learning morpheme patterns for extracting keyword candidates.

본 발명의 일실시예에 따른 키워드 추출 장치(100)는, 상기 학습한 형태소 패턴을 데이터 베이스화하여 메모리(105)에 저장할 수 있을 것이다.The keyword extraction apparatus 100 according to an embodiment of the present invention may convert the learned morpheme pattern into a database and store it in the memory 105 .

이어서 S503 단계에서 제어부(103)는 추출된 키워드 후보에 대해 문법성 및 출현 빈도 중 하나 이상을 고려하여 제 1 필터링을 수행할 수 있다.Subsequently, in step S503, the controller 103 may perform the first filtering on the extracted keyword candidate in consideration of one or more of grammaticality and frequency of appearance.

문법성을 고려하여 필터링한다는 것은, 키워드 후보 중에서 문법에 맞지 않는 단어(들)를 제거한다는 의미이다.Filtering in consideration of grammatical properties means removing word(s) that do not fit the grammar from among keyword candidates.

출현 빈도란 '추출대상 문서'에서의 출현 빈도를 의미할 수 있다. 출현 빈도가 높지 않는 단어(들)는, 중요도가 높지 않는 것으로 판단하여, 키워드 후보에서 제거하도록 필터링 될 수 있다.The frequency of appearance may mean the frequency of appearance in the 'extraction target document'. Word(s) that do not appear frequently may be filtered to be removed from keyword candidates by determining that their importance is not high.

상기와 같은 제 1 필터링은, 키워드 후보로서 적합하지 않는 키워드를 간단한 방법으로 개략적으로 걸러내는 역할을 할 수 있다.The first filtering as described above may serve to roughly filter out keywords that are not suitable as keyword candidates in a simple way.

특히 본 발명의 일실시예에 따르면 상기 S503 단계는 언어 모델(language model, LM)을 이용하도록 제안한다. 언어 모델이란, 연속되는 적어도 두 개의 단어 시퀀스(단어 순서열)에 대한 확률 분포(probability distribution)를 의미한다. 즉, 언어라는 현상을 모델링하고자 단어 시퀀스에 확률을 할당하는 모델을 의미한다. 언어 모델은 특정 단어 시퀀스가 사용될 확률을 계산한다. 예를 들어 '코로나 바이러스'라는 단어 시퀀스가 사용되는 확률을 계산하는데 사용될 수 있다.In particular, according to an embodiment of the present invention, in step S503, it is proposed to use a language model (LM). The language model means a probability distribution for at least two consecutive word sequences (word sequences). In other words, it refers to a model that assigns probabilities to word sequences to model the phenomenon of language. The language model calculates the probability that a particular sequence of words will be used. For example, it can be used to calculate the probability that the word sequence 'coronavirus' is used.

언어 모델은, 하나의 예시로 단어

과 단어

로 이루어지는 단어 시퀀스에 대해 아래 수학식 1과 같이 표현될 수 있다.A language model is, as an example, a word

and words

It can be expressed as in Equation 1 below for a word sequence consisting of .

여기서

은 분석소스 상에서 단어

를 카운트 한 개수,

는 분석소스 상에서 단어 시퀀스

를 카운트 한 개수를 의미한다.here

is a word in the analysis source

the number of counts,

is a word sequence on the analysis source

means the number of counts.

예를 들어 분석 소스(예를 들어 뉴스 콘텐츠)로부터 추출된 키워드를 카운트했을 때, '메르스'라는 단어는 1,346,709개, '메르스 잠복기'라는 단어 시퀀스가 1,038개, '메르스 증상'이라는 단어 시퀀스가 8,018개 나왔다고 가정한다. 언어 모델을 통하여 '메르스 잠복기'와 '메르스 증상' 각각에 대해서 확률을 계산할 경우, 아래 수학식 2 및 3과 같이 계산된다. For example, when keywords extracted from an analysis source (eg news content) were counted, the word 'MERS' was 1,346,709, the word sequence 'MERS incubation period' was 1,038, and the word 'MERS symptom' Assume there are 8,018 sequences. When calculating the probabilities for each of the 'MERS incubation period' and the 'MERS symptom' through the language model, it is calculated as shown in Equations 2 and 3 below.

즉, 언어 모델은 'A'라는 단어가 단독으로 사용되는 경우 대비 'A+B' 단어 시퀀스로 사용되는 비율을 의미한다.That is, the language model means the ratio of using the word 'A' as a sequence of words 'A+B' compared to the case where the word 'A' is used alone.

특히, 본 발명의 일실시예에 따라 언어 모델을 적용하는데 있어서, 두 가지 종류의 언어 모델을 조합하여 활용하도록 제안한다. 두 가지 종류 중 하나는 문법을 고려하는 문법성을 중점적으로 고려하기 위하여 신뢰성이 높은 소스로부터 학습을 수행한 '문법성 언어 모델'이고, 나머지 하나는 '추출대상 문서' 내의 내용을 반영하기 위한 '정보성 언어 모델'이다. 두 가지 종류의 언어 모델에 대해서 도 6을 참조하여 설명한다.In particular, in applying a language model according to an embodiment of the present invention, it is proposed to use a combination of two types of language models. One of the two types is a 'grammar language model' that learns from a reliable source in order to focus on grammatical characteristics considering grammar, and the other is a 'grammar language model' for reflecting the contents of the 'extraction target document'. informational language model. Two types of language models will be described with reference to FIG. 6 .

도 6은 본 발명의 일실시예에 따른 두 가지 종류의 언어 모델의 학습 개념도를 도시하는 도면이다.6 is a diagram illustrating a learning conceptual diagram of two types of language models according to an embodiment of the present invention.

'추출대상 문서' 자체의 내용은 사용자의 음성 입력을 텍스트로 전사한 것일 수 있기 때문에 문법적인 신뢰도는 높지 않을 수 있다. 또한, 일반적인 방식으로 작성된 문서의 경우에도 공신력이 없는 문서(예를 들어 단순 회의록이나 간단한 메모 정도)라면 문법적인 면에서 높은 신뢰도를 기대할 수 없다. 하지만 뉴스 콘텐츠의 경우에는 높은 신뢰도를 기대할 수 있으며, 나아가 뉴스 본문에서 더 높은 신뢰도가 기대될 수 있다.Since the content of the 'extraction target document' itself may be a text transcription of the user's voice input, the grammatical reliability may not be high. In addition, even in the case of a document written in a general manner, high reliability cannot be expected in terms of grammar if the document has no credibility (eg, simple meeting minutes or simple memos). However, in the case of news content, high reliability can be expected, and further, higher reliability can be expected from the news body.

그렇기 때문에 본 발명의 일실시예에서는, 단어 시퀀스에 대해서 문법성을 분석하기 위하여, 뉴스 콘텐츠(601)를 기반으로 학습한 '문법성 언어 모델(603)'을 활용하도록 제안한다(즉 언어 모델의 분석소스가 뉴스 콘텐츠). 더 나아가 본 발명의 일실시예에서는 뉴스 콘텐츠(601) 중에서도 뉴스 본문만으로 '문법성 언어 모델(603)'을 학습할 수 있다. 이렇게 학습된 '문법성 언어 모델(603)'을 통해서 특정 단어 시퀀스의 확률을 계산할 경우, 계산된 확률은 문법적으로 높은 신뢰성을 가질 수 있을 것이다.Therefore, in one embodiment of the present invention, in order to analyze the grammar of the word sequence, it is proposed to utilize the 'grammar language model 603' learned based on the news content 601 (that is, the language model The source of analysis is news content). Furthermore, according to an embodiment of the present invention, the 'grammar language model 603' can be learned only from the news body among the news content 601 . When the probability of a specific word sequence is calculated through the learned 'grammar language model 603', the calculated probability may have high grammatical reliability.

이를 위해서 본 발명의 일실시예에 따른 키워드 추출 장치(100)는, 복수 개의 뉴스 콘텐츠(601)로부터 ngram(unigram, bigram 등)을 추출하고, 추출된 내용을 기초로 '문법성 언어 모델(603)'을 학습할 수 있다.To this end, the keyword extraction apparatus 100 according to an embodiment of the present invention extracts ngrams (unigram, bigram, etc.) from a plurality of news contents 601, and based on the extracted contents, a 'grammar language model 603' )' can be learned.

'정보성 언어 모델(604)'은 '추출대상 문서(602)' 내에서의 내용을 반영하는 모델이다. 해당 '추출대상 문서(602)' 내에서 특정 단어 시퀀스가 얼마나 자주 사용되는지에 대한 확률을 계산하기 위함이다. 즉, '정보성 언어 모델(604)'의 분석소스는 '추출대상 문서(602)'가 될 것이다.The 'information language model 604' is a model that reflects the contents in the 'extraction target document 602'. This is to calculate a probability of how often a specific word sequence is used in the 'extraction target document 602'. That is, the analysis source of the 'information language model 604' will be the 'extraction target document 602'.

이를 위해서, 본 발명의 일실시예에 따른 키워드 추출 장치(100)는 '추출대상 문서(602)'로부터 ngram을 추출하고, 이로부터 '정보성 언어 모델(604)'을 학습할 수 있다. 간단하게 '정보성 언어 모델(604)'은, '추출대상 문서(602)'를 기반으로 학습한 모델을 의미한다.To this end, the keyword extraction apparatus 100 according to an embodiment of the present invention may extract ngrams from the 'extraction target document 602' and learn the 'information language model 604' therefrom. Simply, the 'information language model 604' refers to a model learned based on the 'extraction target document 602'.

본 발명의 일실시예에 따른 키워드 추출 장치(100)는, '문법성 언어 모델(603)'과 '정보성 언어 모델(604)'을 함께 고려하여 상기 제 1 필터링(S503 단계)을 수행할 수 있다. 두 종류의 모델을 함께 고려할 경우, 문법성과 출현 빈도를 모두 고려하는 필터링을 수행할 수 있을 것이다. 본 발명의 일실시예에서는 두 종류의 모델을 함께 활용하기 위하여, 'pointwise KL(kullback) divergence'를 활용하도록 제안한다. 하지만 'pointwise KL(kullback) divergence'를 활용하는 것은 하나의 예시일 뿐, 본 발명이 이러한 방법에 한정되는 것은 아니다.The keyword extraction apparatus 100 according to an embodiment of the present invention performs the first filtering (step S503) in consideration of both the 'grammar language model 603' and the 'information language model 604'. can When both types of models are considered together, filtering that considers both grammatical and appearance frequency may be performed. In one embodiment of the present invention, in order to use two types of models together, it is proposed to use 'pointwise KL (kullback) divergence'. However, the use of 'pointwise KL (kullback) divergence' is only an example, and the present invention is not limited to this method.

다시 도 5로 복귀하여, S504 단계에서 제어부(103)는 제 1 필터링 수행 결과에 대해 공통되는 문자열을 기준으로 키워드 중복을 제거한다. 이하 도 7 및 도 8을 참조하여 공통되는 문자열을 기준으로 키워드 중복을 제거하는 제어 방법에 대해서 좀 더 상세히 살펴본다.Returning to FIG. 5 , in step S504 , the control unit 103 removes the keyword duplication based on a character string common to the first filtering performance result. Hereinafter, a control method for removing duplicate keywords based on a common character string will be described in more detail with reference to FIGS. 7 and 8 .

도 7은 본 발명의 일실시예에 따라, 공통되는 문자열을 기준으로 키워드 중복을 제거(S504 단계)하는 제어 순서도를 도시하는 도면이다. 도 8은 공통되는 문자열을 기준으로 키워드 중복을 제거하는 개념 예시를 도시한다. 도 7 및 도 8을 함께 참조하여 설명한다.7 is a diagram illustrating a control flowchart for removing duplicate keywords based on a common character string (step S504) according to an embodiment of the present invention. 8 shows an example of a concept of removing duplicate keywords based on a common character string. It will be described with reference to FIGS. 7 and 8 together.

제 1 필터링 수행 결과에 대하여 공통되는 문자열을 기준으로 클러스터링을 수행(S504-1)한다. 공통되는 문자열이란 판단 대상이 되는 두 개의 키워드가 공통으로 가지고 있는 문자열을 의미한다. 즉, 문자열이 공통되는 키워드들을 하나의 그룹으로 그룹핑하기 위함이다.Clustering is performed based on a common character string on the result of performing the first filtering (S504-1). The common character string means a character string that two keywords to be judged have in common. That is, it is to group keywords having a common character string into one group.

보다 구체적으로 본 발명에서는, 공통되는 문자열이 있는지 여부 및 공통되는 문자열의 길이를 산출하기 위하여 LCS(Longest Common Subsequence) 알고리즘을 활용할 수 있다. 제어부(103)는 상기 제 1 필터링 된 후보들 간에 LCS를 산출하고, 산출된 LCS의 길이가 소정값 이상일 경우 단일 그룹으로 그룹핑 할 수 있다. 본 발명에서는 효율적인 반복을 위하여 최소 LCS 값이 3에서부터, 클러스터링 횟수가 증가할 때마다 1씩 증가시킬 수 있을 것이다.More specifically, in the present invention, a Longest Common Subsequence (LCS) algorithm may be used to calculate whether there is a common character string and the length of the common character string. The control unit 103 may calculate an LCS among the first filtered candidates, and if the calculated LCS length is equal to or greater than a predetermined value, it may be grouped into a single group. In the present invention, for efficient iteration, the minimum LCS value may be increased from 3 to 1 each time the number of clustering increases.

도 8에 도시된 예시에서, '크리스마스', '크리스마스 연휴', '크리스마스 이벤트' 등이 포함되어 있는 키워드 후보(801)가 추출되었다. 상기 키워드 후보(801)는 S503 단계에서의 필터링이 수행된 결과일 수 있을 것이다. '크리스마스', '크리스마스 연휴', '크리스마스 이벤트' 세 개의 키워드 후보는 '크리스마스'라는 문자열을 공통적으로 포함하고 있기 때문에 하나의 제 1 그룹(802-1)로 그룹핑 될 수 있다. 마찬가지로, '안드로이드'라는 문자열을 공통적으로 포함하고 있는 키워드들은 제 2 그룹(802-2)로 그룹핑 될 수 있다.In the example shown in FIG. 8 , keyword candidates 801 including 'Christmas', 'Christmas holiday', 'Christmas event' and the like are extracted. The keyword candidate 801 may be a result of the filtering performed in step S503. Since the three keyword candidates 'Christmas', 'Christmas holiday', and 'Christmas event' commonly include the string 'Christmas', they may be grouped into one first group 802-1. Similarly, keywords including the string 'Android' in common may be grouped into the second group 802-2.

이어서 그룹핑 된 각 그룹의 대표 키워드를 제외한 나머지 키워드를 제거(S504-2 단계)하는 방식으로 제 1 중복 제거가 수행(S504 단계)될 수 있다. 도 8에 도시된 예시의 제 1 그룹(802-1)에서 '크리스마스'라는 대표 키워드를 선정하고, 나머지 '크리스마스 연휴' 및 '크리스마스 이벤트'는 제거되도록 제 1 중복 제거가 수행 된다.Subsequently, the first duplicate removal may be performed (step S504) in a manner of removing the remaining keywords except for the representative keyword of each grouped group (step S504-2). A first deduplication is performed so that a representative keyword 'Christmas' is selected from the first group 802-1 shown in FIG. 8 and the remaining 'Christmas holidays' and 'Christmas events' are removed.

본 발명의 일실시예에 따른 키워드 추출 장치(100)는 그룹 내에서 대표 키워드를 선정하기 위하여, 상술한 '문법성 언어 모델(603)'과 '정보성 언어 모델(604)'을 함께 고려할 수 있다. 예를 들어 'pointwise KL(kullback) divergence'을 통하여 두 언어 모델로부터의 확률 점수를 산출하고, 가장 높은 확률 점수를 가진 키워드를 대표 키워드로 선정할 수 있을 것이다. 즉, 대표 키워드는 문법적인 신뢰성이 높으면서 동시에 '추출대상 문서' 상에서 빈번하게 사용되는 단어가 될 수 있을 것이다.The keyword extraction apparatus 100 according to an embodiment of the present invention may consider the above-mentioned 'grammar language model 603' and 'information language model 604' together in order to select a representative keyword within a group. have. For example, through 'pointwise KL (kullback) divergence', a probability score from two language models may be calculated, and a keyword with the highest probability score may be selected as a representative keyword. That is, the representative keyword may be a word frequently used on the 'extraction target document' while having high grammatical reliability.

다시 도 5로 복귀하여, S504에서 중복 제거가 완료된 키워드 후보들 각각에 대해서 랭크를 산출(S505 단계) 할 수 있다. 여기서 랭크는, '추출대상 문서' 상에서 해당 키워드가 가지는 가중치(weight)에 기초하여 결정되며, '추출대상 문서' 상에서 키워드의 중요도를 의미할 수 있을 것이다. 즉, 가중치란 키워드 후보들 간에 상대적인 순위를 결정하기 위한 중요도 점수를 의미한다.Returning to FIG. 5 again, it is possible to calculate a rank for each of the keyword candidates for which duplication has been removed in S504 (step S505). Here, the rank is determined based on a weight of the corresponding keyword on the 'extraction target document', and may mean the importance of the keyword on the 'extraction target document'. That is, the weight means an importance score for determining a relative ranking among keyword candidates.

본 발명의 일실시예에 따른 키워드 추출 장치(100)는 다양한 접근 방법으로 산출된 가중치들을 조합하여 종합 가중치를 획득하도록 제안한다.The keyword extraction apparatus 100 according to an embodiment of the present invention proposes to obtain a comprehensive weight by combining weights calculated by various approaches.

* 출현 빈도수를 고려한 제 1 가중치(NPweight)* First weight considering the frequency of appearance (NPweight)

제 1 가중치는 간단하게 출현 빈도수의 상대적인 비교를 통하여 산출되며, 아래 수학식 4를 통하여 산출될 수 있다.The first weight is simply calculated through a relative comparison of the frequency of appearance, and can be calculated through Equation 4 below.

keyword_i는 가중치 산출 대상 키워드를 의미하며, n는 전체 키워드 후보의 개수(여기서 전체란 순위를 결정하기 위한 키워드 후보),

는 '추출대상 문서' 상에서 전체 키워드에 대한 카운트 합을 의미한다.keyword _i means a keyword to be weighted, n is the total number of keyword candidates (here, all keyword candidates for determining the ranking);

denotes the sum of counts for all keywords on the 'extraction target document'.

* 문장과 키워드의 중요성을 함께 고려한 제 2 가중치(TF-IDF_weight)* Second weight considering the importance of sentences and keywords (TF-IDF _weight )

TF-IDF(Term Frequency - Inverse Document Frequency)는 정보 검색과 텍스트 마이닝에서 이용하는 가중치로, 여러 문서로 이루어진 문서군이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치이다. 문서의 핵심어를 추출하거나, 검색 엔진에서 검색 결과의 순위를 결정하거나, 문서들 사이의 비슷한 정도를 구하는 등의 용도로 사용할 수 있다.TF-IDF (Term Frequency - Inverse Document Frequency) is a weight used in information retrieval and text mining, and is a statistical number indicating how important a word is within a specific document when there is a document group consisting of several documents. It can be used for purposes such as extracting key words from documents, determining the ranking of search results in a search engine, or finding the degree of similarity between documents.

도 9는 본 발명의 일실시예에 따른 제 2 가중치 산출 예시를 도시한다.9 shows an example of calculating a second weight according to an embodiment of the present invention.

'추출대상 문서'가 여러 문장으로 이루어진 경우, 제 2 가중치는 전체 '추출대상 문서'상에서 키워드의 카운트 값과 해당 키워드가 포함된 문장의 카운트 값을 모두 고려하기 위한 가중치이다.When the 'extraction target document' consists of several sentences, the second weight is a weight for considering both the count value of the keyword and the count value of the sentence including the keyword on the entire 'extraction target document'.

제 2 가중치는 아래 수학식 5에서와 같이 산출될 수 있다.The second weight may be calculated as in Equation 5 below.

keyword_i는 가중치 산출 대상 키워드이고, sentence_i는 가중치 산출 대상 키워드이 포함되어 있는 문장이며, TF(Term Frequency)는 '추출대상 문서' 상에서 특정 키워드(가중치 산출 대상 키워드)의 카운트 값이고, IDF(Inverse Document Frequency)는 특정 키워드가 포함되어 있는 문장의 카운트 값을 전체 문장수로 나누고 log를 취한 값을 의미한다.keyword _i is a keyword for weight calculation, sentence _i is a sentence containing a keyword for weight calculation, TF (Term Frequency) is a count value of a specific keyword (keyword for weight calculation) on the 'extraction target document', and IDF (Inverse) Document Frequency) means a value obtained by dividing the count value of sentences containing a specific keyword by the total number of sentences and taking the log.

도시된 예시를 참조하면 각 키워드 후보에 대해서 TF, IDF를 산출하고, TF와 IDF를 곱하여 제 2 가중치를 얻을 수 있다. 도 9에서 '결과'라는 키워드를 예로 들면, 전체에서 '결과'라는 키워드를 카운트하면 3이 나오기 때문에 TF를 3이라고 확인할 수 있다. 그리고 '결과'라는 키워드가 포함되어 있는 문장의 개수는 2개이고, 전체 문장의 개수는 3개이기 때문에, IDF는 log(2/3)이다.Referring to the illustrated example, TF and IDF may be calculated for each keyword candidate, and a second weight may be obtained by multiplying the TF and IDF. Taking the keyword 'result' in FIG. 9 as an example, it can be confirmed that TF is 3 because 3 is obtained when the keyword 'result' is counted from the whole. And since the number of sentences including the keyword 'result' is 2 and the total number of sentences is 3, IDF is log(2/3).

* 문장 간의 유사도를 고려한 제 3 가중치(Sentence_weight)* Third weight considering the similarity between sentences (Sentence _weight )

제 3 가중치는 해당 키워드가 포함된 문장에 대한 가중치 값으로, 텍스트랭크(Textrank) 알고리즘을 활용하여 문장 간의 유사도를 반영할 수 있다. 제 3 가중치는 아래 수학식 6으로 산출될 수 있다. '추출대상 문서'상에서 해당 문장이 가지고 있는 중요도를 의미한다.The third weight is a weight value for a sentence including a corresponding keyword, and the similarity between sentences may be reflected by using a Textrank algorithm. The third weight may be calculated by Equation 6 below. It means the importance of the sentence in the 'extraction target document'.

상기 텍스트랭크 알고리즘은, 널리 알려진 알고리즘으로 보다 구체적인 설명은 생략한다.The TextRank algorithm is a well-known algorithm and a detailed description thereof will be omitted.

상기 제 1 내지 제 3 가중치를 종합적으로 고려하기 위하여, 아래 수학식 7에서와 같이 종합 가중치를 결정할 수 있다.In order to comprehensively consider the first to third weights, the total weights may be determined as in Equation 7 below.

여기서 a, b 및 c의 값은 실험적으로 정확도가 가장 높은 값을 선정하여 설정하기 위한 상수이다. 예를 들어, a=0.5, b=0.25 및 c=0.25로 설정할 수 있다.Here, the values of a, b, and c are constants for experimentally selecting and setting a value with the highest accuracy. For example, a=0.5, b=0.25 and c=0.25 can be set.

즉, 키워드 추출 장치(100)가 S505 단계에서의 랭크를 산출하기 위하여, 수학식 7에서의 종합 가중치를 활용할 수 있다.That is, in order for the keyword extraction apparatus 100 to calculate the rank in step S505, the comprehensive weight in Equation 7 may be used.

도 5의 순서도로 복귀하여 S506 단계에서는 산출된 랭크에 기초하여 제 2 필터링을 수행한다. 랭크가 낮은 키워드 후보를 제외하기 위함이다. 결국 제 2 필터링에 의하면, 중요도가 낮은 키워드 후보는 제외시킬 수 있다.Returning to the flowchart of FIG. 5 , in step S506, second filtering is performed based on the calculated rank. This is to exclude keyword candidates with a low rank. As a result, according to the second filtering, keyword candidates with low importance may be excluded.

본 발명의 일실시예에 따른 키워드 추출 장치(100)는 최종 선정하고자 하는 키워드의 개수가 N일 때, N*(N-1)의 키워드를 남기고 나머지는 제외시키도록 제 2 필터링할 수 있다. 왜냐하면, 남은 키워드 후보에 대해서 의미적 중복을 제거해야 하는데, 의미적 중복을 고려하기 위한 연산량이 많기 때문이다. 즉, 연산량의 효율적인 관리를 위하여, 중요도가 낮은 키워드 후보는 제 2 필터링 단계에서 제거시키는 것이다.When the number of keywords to be finally selected is N, the keyword extraction apparatus 100 according to an embodiment of the present invention may perform second filtering to leave the keywords of N*(N-1) and exclude the rest. This is because semantic duplication must be removed for the remaining keyword candidates, and the amount of computation required to consider semantic duplication is large. That is, in order to efficiently manage the amount of computation, keyword candidates with low importance are removed in the second filtering step.

이어서 S507 단계에서는, 남아 있는 키워드 후보들 간의 의미적 중복을 고려하여, 중복된 키워드를 제거한다. 즉 중복되는 키워드들에 대해서 대표 키워드만을 남기고 제거시키는 것이다.Subsequently, in step S507, the duplicate keywords are removed in consideration of the semantic overlap between the remaining keyword candidates. That is, with respect to duplicate keywords, only representative keywords are left and removed.

상술한 바와 같이, 음성 전사 문서 상에서 화자가 복수일 경우 동일한 의미를 다르게 표현하는 경우가 빈번하게 존대한다. 음성 전사 문서 뿐만 아니라, 일반적인 문서에서도 의미적인 중복을 제거할 필요성이 있을 것이다. 그렇기 때문에 본 발명의 일실시예에서는, 의미적으로 동일/유사한지(중복인지)를 판단하고, 판단 결과에 따라 중복되는 키워드를 후보에서 제외시키도록 제안하는 것이다.As described above, when there are a plurality of speakers in a transcription document, it is frequently honored to express the same meaning differently. There will be a need to remove semantic redundancy not only in the phonetic transcription document, but also in the general document. Therefore, in one embodiment of the present invention, it is determined whether they are semantically identical/similar (duplicate), and it is proposed to exclude overlapping keywords from candidates according to the determination result.

이를 위해서 본 발명의 일실시예에서는 대량의 문서에서 학습한 임베딩 벡터(예를 들면 sent2vec)를 이용하도록 제안한다. 대량의 문서에서 학습된 임베딩 벡터이므로, 이를 이용하여 산출된 임베딩 벡터는 키워드의 의미를 내포할 수 있다. 각 키워드 후보에 대해서 임베딩 벡터가 산출된 후, 임베딩 벡터 간의 유사도가 높으면 의미적으로 중복되었다고 판단할 수 있을 것이다.For this, an embodiment of the present invention proposes to use an embedding vector (eg, sent2vec) learned from a large number of documents. Since it is an embedding vector learned from a large number of documents, an embedding vector calculated using the embedding vector may contain the meaning of a keyword. After the embedding vectors are calculated for each keyword candidate, if the similarity between the embedding vectors is high, it may be determined that they are semantically overlapped.

예시적으로, 코사인 유사도(cosine similarity)를 통하여 두 키워드 후보 간에 유사도를 산출할 수 있을 것이다.For example, the similarity between two keyword candidates may be calculated through cosine similarity.

도 10은 본 발명의 일실시예에 따라 의미적 중복을 고려하여 키워드 중복을 제거하는 개념도를 도시하는 도면이다.10 is a diagram illustrating a conceptual diagram of removing keyword duplication in consideration of semantic duplication according to an embodiment of the present invention.

세 개의 키워드(1001-1 ~ 1001-3) 각각에 대해 임베딩 벡터(1002-1 ~ 1002-3)를 산출한다. 산출된 임베딩 벡터(1002-1 ~ 1002-3)는 각 키워드의 의미를 내포한 벡터일 것이다. 그리고 산출된 임베딩 벡터(1002-1 ~ 1002-3) 간에 유사도를 산출하고, 산출된 유사도가 기설정 값 이상일 경우 두 키워드의 의미가 동일하다고 판단할 수 있다. 예를 들어 비교 대상인 두 키워드에 대해서 코사인 유사도(1003-1 ~ 1003-3)를 산출하고, 산출된 유사도(1003-1 ~ 1003-3)가 0.6 이상일 경우 같은 의미라고 판단할 수 있다.An embedding vector (1002-1 to 1002-3) is calculated for each of the three keywords (1001-1 to 1001-3). The calculated embedding vectors 1002-1 to 1002-3 may be vectors containing the meaning of each keyword. Then, the similarity between the calculated embedding vectors 1002-1 to 1002-3 is calculated, and when the calculated similarity is equal to or greater than a preset value, it can be determined that the meanings of the two keywords are the same. For example, cosine similarity (1003-1 to 1003-3) is calculated for two keywords to be compared, and when the calculated similarity (1003-1 to 1003-3) is 0.6 or more, it can be determined that they have the same meaning.

도 10의 예시에서 '정리해고(1001-1)'에 대한 제 1 임베딩 벡터(1002-1)와 '쌍용 자동차(1001-2)'에 대한 제 2 임베딩 벡터(1002-2)에 대해서 제 1 코사인 유사도(1003-1)를 산출했더니 0.57이라는 값이 산출되었다. 제 1 코사인 유사도(1003-1)가 0.6 보다 낮은 값이기 때문에, '정리해고(1001-1)'와 '쌍용 자동차(1001-2)' 두 개의 키워드는 서로 다른 의미의 키워드(즉, 의미적 중복이 아님)라고 판단할 수 있다.In the example of FIG. 10 , the first embedding vector 1002-1 for 'laundry (1001-1)' and the second embedding vector 1002-2 for 'Ssangyong Motor (1001-2)' are first When the cosine similarity (1003-1) was calculated, a value of 0.57 was calculated. Since the first cosine similarity (1003-1) is a value lower than 0.6, the two keywords 'layoff (1001-1)' and 'Ssangyong Motor (1001-2)' are keywords with different meanings (that is, semantic It is not a duplicate).

마찬가지로 '쌍용 자동차(1001-2)'와 '쌍용차(1001-3)'에 대해서 같은 과정을 반복하면, 0.86이라는 제 3 코사인 유사도(1003-3)가 산출된다. 이는 0.6 보다 높은 수치이기 때문에, '쌍용 자동차(1001-2)'와 '쌍용차(1001-3)'는 서로 동일한 의미를 내포한 키워드라고 판단하고, 둘 중 하나를 제거 시킬 수 있을 것이다.Similarly, if the same process is repeated for 'Ssangyong Motor 1001-2' and 'Ssangyong Motor 1001-3', a third cosine similarity 1003-3 of 0.86 is calculated. Since this is a higher number than 0.6, it is determined that 'Ssangyong Motor (1001-2)' and 'Ssangyong Motor (1001-3)' are keywords with the same meaning, and either one can be removed.

마지막으로 키워드 추출 장치(100)는 S501 ~ S507 단계를 거치고 남은 키워드 후보 전부 또는 일부를 '추출대상 문서'에 대한 주요 키워드로 선정할 수 있을 것이다.Finally, the keyword extraction apparatus 100 may select all or some of the remaining keyword candidates as the main keywords for the 'extraction target document' after the steps S501 to S507.

한편, 본 발명의 일실시예에 따른 키워드 추출 장치(100)는 S501 ~ S507 단계와 함께 금칙어 필터링 단계가 함께 수행할 수 있다. 금칙어 필터링이란, 비속어나 비문법적인 키워드뿐만 아니라 키워드로 선정되기 어려운 키워드를 제외시키는 것을 의미한다.On the other hand, the keyword extraction apparatus 100 according to an embodiment of the present invention may perform the forbidden word filtering step together with the steps S501 to S507. Kinsoku filtering means excluding keywords that are difficult to be selected as keywords as well as profanity or non-grammatical keywords.

도 11은 본 발명의 일실시예에 따른 금칙어 필터링의 개념도를 도시하는 도면이다.11 is a diagram illustrating a conceptual diagram of filtering nonsense words according to an embodiment of the present invention.

이를 위해서 본 발명의 일실예에 따른 금칙어 사전(1101)은, 형태소 분석기 사전(1101-1), 추상적 표현 사전(1101-2) 및 사용자 지정 금칙어 사전(1101-3) 중 적어도 하나를 포함할 수 있다. 본 발명의 일실시예에 따른 키워드 추출 장치(100)는, 키워드 후보 중에서 상기 금칙어 사전(1101)에 포함되어 있는 키워드가 있는지 판단하고, 만약 그런 키워드가 있다면 키워드 후보에서 제거시키도록 금칙어 필터링을 수행할 수 있을 것이다.For this purpose, the forbidden word dictionary 1101 according to an embodiment of the present invention may include at least one of a morpheme analyzer dictionary 1101-1, an abstract expression dictionary 1101-2, and a user-specified forbidden word dictionary 1101-3. have. The keyword extracting apparatus 100 according to an embodiment of the present invention determines whether there is a keyword included in the forbidden word dictionary 1101 among the keyword candidates, and if there is such a keyword, performs non-restrictive word filtering to remove it from the keyword candidate. You can do it.

형태소 분석기 사전(1101-1)은, 부사나 중의성을 갖는 표현들을 저장하는 데이터 베이스이다. 예를 들어 '한편'이나 '각각' 등의 표현을 저장할 수 있다. 이러한 표현은 키워드로 추출하기에 적합하지 못하므로, 키워드 후보에서 제외시키는 것이 바람직할 것이다.The morpheme analyzer dictionary 1101-1 is a database that stores expressions having adverbs or ambiguities. For example, expressions such as 'on the other hand' or 'each' can be stored. Since such an expression is not suitable for extraction as a keyword, it may be desirable to exclude it from keyword candidates.

추상적 표현 사전(1101-2)은 추상적인 표현들을 정리한 데이터 베이스이다. 예를 들어, '이날', '이때', '내년도', '나머지' 등의 표현들을 저장한다. 이러한 표현은 '세종 말뭉치'에서 참조할 수 있다.The abstract expression dictionary 1101-2 is a database in which abstract expressions are arranged. For example, expressions such as 'this day', 'this time', 'next year', 'remaining' are stored. These expressions can be referred to in 'Sejong Corpus'.

그 외에도 키워드로 선정되기 부적합한 표현들은 사용자 지정 금칙어 사전(1101-3)에 저장될 수 있다. 예를 들어 '근데' 또는 '이제' 등의 표현을 들 수 있을 것이다.In addition, expressions inappropriate to be selected as keywords may be stored in the user-designated forbidden word dictionary 1101-3. For example, expressions such as 'but' or 'now' may be used.

이런 금칙어 필터링 단계는, 주요 키워드 추출 단계(S1102) 단계 전이나 후에 수행될 수도 있지만, 바람직하게는 S502 단계에서 추출된 키워드 후보를 대상으로 수행될 수 있을 것이다.This forbidden word filtering step may be performed before or after the main keyword extraction step (S1102), but preferably it may be performed on the keyword candidates extracted in the step S502.

더 나아가 본 발명에서는, 동일한 위치에서 추출되는 키워드들을 하나로 취급하여, 키워드 간에 중복되는 것을 최소화시키도록 제안한다.Furthermore, in the present invention, it is proposed to minimize duplication between keywords by treating keywords extracted from the same location as one.

도 12는 본 발명의 일실시예에 따라 추출되는 위치를 고려한 중복 제거 방식의 개념도를 도시하는 도면이다.12 is a diagram illustrating a conceptual diagram of a deduplication method in consideration of an extracted location according to an embodiment of the present invention.

도시된 예시를 참조하면 '추출대상 문서'의 소정 문장(1200)에서 키워드 후보가 추출된다. 이때 추출되는 키워드 후보(1201-1 ~ 1201-3)는 소정 문장(1200)에서 표 2의 기설정된 형태소 패턴에 해당하는 단어일 수 있다. 소정 문장(1200)에 포함되어 있는 '수중 침투 가능성'이라는 단어 시퀀스로부터 제 1 내지 제 3 키워드 후보(1201-1 ~ 1201-3)가 추출되었다.Referring to the illustrated example, a keyword candidate is extracted from a predetermined sentence 1200 of the 'extraction target document'. In this case, the extracted keyword candidates 1201-1 to 1201-3 may be words corresponding to the preset morpheme patterns of Table 2 in the predetermined sentence 1200 . The first to third keyword candidates 1201-1 to 1201-3 were extracted from the word sequence 'underwater penetration possibility' included in the predetermined sentence 1200 .

기설정 형태소 패턴preset morpheme pattern 추출된 키워드 후보Extracted keyword candidates NNG NNGNNG NNG 수중 침투underwater penetration NNG NNG NNGNNG NNG NNG 수중 침투 가능성Possibility of underwater penetration NNG NNGNNG NNG 침투 가능성Penetration potential

키워드 후보 상에서 특정 위치의 단어를 포함하는 복수 개의 단어 시퀀스가 존재할 경우, 그 복수 개의 단어 시퀀스는 중복되는 키워드일 가능성이 높으므로, 이를 하나로 취급하도록 제안하는 것이다. 도 12의 예시에서 추출된 제 1 내지 제 3 키워드 후보(1201-1 ~ 1201-3)는, 특정 위치의 '침투'라는 단어를 포함하는 복수 개의 단어 시퀀스로서, 각각을 개별적인 키워드 후보로 볼 필요가 없을 것이다. 예를 들어, 제 2 키워드(1201-2, 수중 침투 가능성)만을 키워드 후보에 남기고, 나머지 제 1 및 제 3 키워드(1201-1, 1201-3)는 키워드 후보에서 제외시킬 수 있을 것이다.When a plurality of word sequences including a word at a specific position on a keyword candidate exist, the plurality of word sequences are highly likely to be overlapping keywords, and thus it is proposed to treat them as one. The first to third keyword candidates 1201-1 to 1201-3 extracted in the example of FIG. 12 are a plurality of word sequences including the word 'penetration' in a specific position, and each of them needs to be viewed as an individual keyword candidate. there will be no For example, only the second keyword 1201-2 (possibility of underwater penetration) may be left in the keyword candidates, and the remaining first and third keywords 1201-1 and 1201-3 may be excluded from the keyword candidates.

이상으로 본 발명에 따른 주요 키워드 추출 장치 및 그것의 제어 방법의 실시예를 설시하였으나 이는 적어도 하나의 실시예로서 설명되는 것이며, 이에 의하여 본 발명의 기술적 사상과 그 구성 및 작용이 제한되지는 아니하는 것으로, 본 발명의 기술적 사상의 범위가 도면 또는 도면을 참조한 설명에 의해 한정／제한되지는 아니하는 것이다. 또한 본 발명에서 제시된 발명의 개념과 실시예가 본 발명의 동일 목적을 수행하기 위하여 다른 구조로 수정하거나 설계하기 위한 기초로써 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의해 사용되어질 수 있을 것인데, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의한 수정 또는 변경된 등가 구조는 청구범위에서 기술되는 본 발명의 기술적 범위에 구속되는 것으로서, 청구범위에서 기술한 발명의 사상이나 범위를 벗어나지 않는 한도 내에서 다양한 변화, 치환 및 변경이 가능한 것이다.Although the embodiments of the main keyword extraction apparatus and the control method thereof have been described above according to the present invention, these are described as at least one embodiment, and the technical spirit of the present invention and its configuration and operation are not limited thereby. That is, the scope of the technical idea of the present invention is not limited / limited by the drawings or the description with reference to the drawings. In addition, the concepts and embodiments of the present invention presented in the present invention can be used by those of ordinary skill in the art as a basis for modifying or designing other structures in order to perform the same purpose of the present invention. , an equivalent structure modified or changed by a person of ordinary skill in the art to which the present invention belongs is bound by the technical scope of the present invention described in the claims, and does not depart from the spirit or scope of the invention described in the claims Various changes, substitutions and changes are possible within the limits.

Claims

In the control method of a device for extracting a keyword on an extraction target document,
extracting a keyword corresponding to a predetermined morpheme pattern from the extraction target document as a candidate;
performing first filtering on the keyword candidate in consideration of at least one of grammaticality and frequency of appearance;
calculating a rank on the extraction target document for the keyword candidate;
performing second filtering on the keyword candidates based on the calculated rank; and
characterized in that it comprises the step of selecting a keyword to be extracted,
Control method of keyword extraction device.

The method of claim 1,
The method further comprising the step of performing a first duplicate removal in consideration of whether a character string is common among the keyword candidates,
Control method of keyword extraction device.

The method of claim 2, wherein performing the first deduplication comprises:
clustering into a plurality of groups in consideration of whether a character string is common among the keyword candidates; and
It characterized in that it comprises the step of removing the remaining keyword candidates except for the representative keyword of each of the plurality of clustered groups,
Control method of keyword extraction device.

The method of claim 3, wherein the clustering comprises:
calculating a Longest Common Subsequence (LCS) between the keyword candidates; and
and bundling keyword candidates whose length of the calculated LCS is greater than or equal to a predetermined value into the same group,
Control method of keyword extraction device.

The method of claim 1, wherein performing the first filtering comprises:
calculating the importance of the keyword candidate in consideration of at least one of grammaticality and frequency of appearance; and
It characterized in that it comprises the step of filtering the calculated keyword candidates with low importance,
Control method of keyword extraction device.

The method of claim 5, wherein the calculating of the importance comprises:
receiving a plurality of news contents;
learning a grammatical language model based on the received news content; and
It characterized in that it comprises the step of calculating the importance based on the learned grammatical language model,
Control method of keyword extraction device.

The method of claim 5, wherein the calculating of the importance comprises:
learning an informational language model based on the extraction target document; and
It characterized in that it comprises the step of calculating the importance based on the learned informational language model,
Control method of keyword extraction device.

The method of claim 1,
The method further comprising the step of removing a second duplication in consideration of semantic duplication between the keyword candidates,
Control method of keyword extraction device.

The method of claim 1, wherein calculating the rank comprises:
characterized in that the rank is calculated in consideration of at least one of a count value of a predetermined keyword and a count value of a sentence including the predetermined keyword on the extraction target document;
Control method of keyword extraction device.

The method of claim 9, wherein performing the second filtering comprises:
characterized in that it is made in such a way as to remove the keyword having the calculated low rank from among the keyword candidates,
Control method of keyword extraction device.

a memory for storing the document to be extracted; and
Including a control unit for extracting a keyword from the stored extraction target document,
The control unit is
extracting a keyword corresponding to a predetermined morpheme pattern from the extraction target document as a candidate;
First filtering is performed on the keyword candidates in consideration of at least one of grammaticality and frequency of appearance,
calculating a rank on the extraction target document for the keyword candidate;
performing a second filtering on the keyword candidates based on the calculated rank;
Characterized in selecting a keyword to be extracted,
keyword extraction device.

12. The method of claim 11, wherein the control unit,
characterized in that the first duplicate removal is performed in consideration of whether the character string is common among the keyword candidates,
keyword extraction device.

The method of claim 12, wherein when the control unit performs the first deduplication,
Clustering into a plurality of groups in consideration of whether the character string is common among the keyword candidates,
characterized in that the remaining keyword candidates except for the representative keyword of each of the clustered plurality of groups are removed,
keyword extraction device.

The method of claim 13, wherein in the clustering by the control unit,
Calculating a Longest Common Subsequence (LCS) between the keyword candidates,
Characterized in grouping keyword candidates having the calculated LCS length equal to or greater than a predetermined value into the same group,
keyword extraction device.

The method of claim 11, wherein when the control unit performs the first filtering,
Calculating the importance of the keyword candidate in consideration of at least one of grammaticality and frequency of appearance,
Characterized in filtering the calculated keyword candidates with low importance,
keyword extraction device.

The method of claim 15, wherein the control unit considers the grammar,
receive input from a plurality of news content,
Learning a grammatical language model based on the received news content,
Characterized in calculating the importance based on the learned grammatical language model,
keyword extraction device.

The method according to claim 15, wherein the control unit considers the frequency of appearance,
Learning an informational language model based on the extraction target document,
Characterized in calculating the importance based on the learned informational language model,
keyword extraction device.

12. The method of claim 11, wherein the control unit,
characterized in that the second duplication removal is performed in consideration of the semantic duplication between the keyword candidates,
keyword extraction device.

12. The method of claim 11, wherein when the control unit calculates the rank,
characterized in that the rank is calculated in consideration of at least one of a count value of a predetermined keyword and a count value of a sentence including the predetermined keyword on the extraction target document;
keyword extraction device.

The method of claim 19, wherein when the control unit performs the second filtering,
characterized in that it is made in such a way as to remove the keyword having the calculated low rank from among the keyword candidates,
keyword extraction device.

A computer program stored in a medium for executing the method of any one of claims 1 to 10 in combination with hardware.