KR102476383B1

KR102476383B1 - A method for extracting keywords from texts based on deep learning

Info

Publication number: KR102476383B1
Application number: KR1020200056390A
Authority: KR
Inventors: 최원익; 김명수; 이상원
Original assignee: 인하대학교 산학협력단
Priority date: 2020-05-12
Filing date: 2020-05-12
Publication date: 2022-12-09
Anticipated expiration: 2040-05-12
Also published as: KR20210138266A

Abstract

딥러닝 기반 키워드 추출 방법 및 장치가 개시된다. 일 실시예에 따른 컴퓨터로 구현되는 키워드 추출 장치에 의해 수행되는 키워드 추출 방법은, 텍스트 데이터로부터 자연어 처리를 위한 임베딩을 수행하는 단계; 및 상기 임베딩이 수행된 텍스트 데이터를 사용하여 개체명 인식(Named Entity Recognition)을 위하여 구성된 딥러닝 기반의 키워드 추출 모델을 학습시킴에 따라 키워드를 추출하는 단계를 포함할 수 있다. A deep learning-based keyword extraction method and apparatus are disclosed. A keyword extraction method performed by a computer-implemented keyword extraction apparatus according to an embodiment includes the steps of performing embedding for natural language processing from text data; and extracting keywords by learning a deep learning-based keyword extraction model configured for Named Entity Recognition using the text data in which the embedding has been performed.

Description

Method and apparatus for extracting keywords based on deep learning {A METHOD FOR EXTRACTING KEYWORDS FROM TEXTS BASED ON DEEP LEARNING}

아래의 설명은 키워드 추출 방법 및 장치에 관한 것이다. The following description relates to a keyword extraction method and apparatus.

최근 고령화 문제가 심각하게 대두되고 있다. 복지관, 병원 등 고령자들의 방문이 잦은 곳에서는 편리함을 위해 문자 메시지를 통해 일정이나 정보를 전달한다. 하지만 고령화 인구는 신문물의 사용에 능숙하지 못하여 이를 활용하기에 어려움이 있다.Recently, the issue of aging has become a serious issue. In places frequently visited by the elderly, such as welfare centers and hospitals, schedules and information are delivered through text messages for convenience. However, the aging population is not good at using new products, so it is difficult to utilize them.

이에, 문자 메시지로부터 키워드를 추출하여 고령자를 대상으로 보다 편리한 기능을 제공하기 위한 기술이 요구되고 있다. Accordingly, there is a demand for a technique for extracting keywords from text messages and providing more convenient functions to the elderly.

딥러닝 기반의 키워드 추출 방법 및 장치를 제공할 수 있다. 상세하게는, 텍스트 데이터로부터 자연어 처리를 위한 임베딩을 수행하고, 임베딩이 수행된 텍스트 데이터를 사용하여 개체명 인식(Named Entity Recognition)을 위하여 구성된 딥러닝 기반의 키워드 추출 모델을 학습시킴에 따라 키워드를 추출하는 방법 및 장치를 제공할 수 있다. A deep learning-based keyword extraction method and apparatus may be provided. Specifically, by performing embedding for natural language processing from text data and learning a deep learning-based keyword extraction model configured for Named Entity Recognition using the embedding-performed text data, the keyword is extracted. An extraction method and apparatus may be provided.

컴퓨터로 구현되는 키워드 추출 장치에 의해 수행되는 키워드 추출 방법은, 텍스트 데이터로부터 자연어 처리를 위한 임베딩을 수행하는 단계; 및 상기 임베딩이 수행된 텍스트 데이터를 사용하여 개체명 인식(Named Entity Recognition)을 위하여 구성된 딥러닝 기반의 키워드 추출 모델을 학습시킴에 따라 키워드를 추출하는 단계를 포함할 수 있다. A keyword extraction method performed by a computer-implemented keyword extraction device includes the steps of performing embedding for natural language processing from text data; and extracting keywords by learning a deep learning-based keyword extraction model configured for Named Entity Recognition using the text data in which the embedding has been performed.

상기 텍스트 데이터는, 문자 메시지이고, 상기 임베딩을 수행하는 단계는, 상기 문자 메시지로부터 임베딩을 수행함에 따라 단어 또는 문장으로 분할하고, 상기 분할된 단어 또는 문장을 벡터화하는 단계를 포함할 수 있다. The text data may be a text message, and the embedding may include dividing the text message into words or sentences as embedding is performed, and vectorizing the divided words or sentences.

상기 임베딩을 수행하는 단계는, CBOW(Continuous Bag of Words) 방식 및 Skip-Gram 방식을 포함하는 Word2Vec를 사용하여 임베딩을 수행하는 단계를 포함할 수 있다. The performing of the embedding may include performing embedding using Word2Vec including a Continuous Bag of Words (CBOW) method and a Skip-Gram method.

상기 임베딩을 수행하는 단계는, 상기 문자 메시지로부터 CBOW(Continuous Bag of Words) 방식의 임베딩을 수행함에 따라 상기 문자 메시지에서 각각의 단어의 주변에 있는 단어들을 이용하여 중심 단어들을 예측하고, 상기 예측된 중심 단어들을 벡터화하는 단계를 포함할 수 있다. The performing of the embedding may include predicting center words using words in the vicinity of each word in the text message as embedding is performed in a continuous bag of words (CBOW) method from the text message, and the predicted It may include vectorizing the center words.

상기 키워드를 추출하는 단계는, Word2Vec를 사용하여 임베딩을 수행함에 따라 획득된 유사도로 변경된 데이터를 입력으로 사용하여 상기 키워드 추출 모델을 학습시키고, 상기 키워드 추출 모델을 학습시킴에 따라 결과값으로 각 문장값을 추출할 키워드의 라벨링 값을 출력하는 단계를 포함할 수 있다. In the step of extracting the keyword, the keyword extraction model is trained by using as an input the data obtained by embedding and the similarity is changed, and each sentence is obtained as a result value by training the keyword extraction model. A step of outputting a labeling value of a keyword from which a value is to be extracted may be included.

상기 텍스트 데이터는, 문자 메시지이고, 상기 키워드를 추출하는 단계는, 상기 문자 메시지에 대한 문장의 시퀀스가 Word2Vec를 사용하여 임베딩을 수행함에 따라 생성된 임베딩 레이어를 통과하여 상기 키워드 추출 모델을 학습시키고, 상기 키워드 추출 모델을 학습시킴에 따라 결과값으로 장소, 시간 또는 행사 중 적어도 하나 이상의 키워드를 출력하고, 상기 출력된 적어도 하나 이상의 키워드를 조합하여 문장을 생성하는 단계를 포함할 수 있다. The text data is a text message, and in the step of extracting the keyword, the sequence of sentences for the text message is passed through an embedding layer generated as embedding is performed using Word2Vec to train the keyword extraction model; The method may include outputting at least one keyword among place, time, and event as a result value as the keyword extraction model is trained, and generating a sentence by combining the outputted at least one or more keywords.

상기 키워드를 추출하는 단계는, 개체명 인식(Named Entity Recognition)을 위하여 딥러닝 기반의 키워드 추출 모델을 구성하는 단계를 포함하고, 상기 키워드 추출 모델은, RNN, LSTM(Long Short-Term Memory), GRU(Gated Recurrent Unit) 중 어느 하나로 구성될 수 있다. The step of extracting the keywords includes constructing a deep learning-based keyword extraction model for Named Entity Recognition, and the keyword extraction model includes RNN, LSTM (Long Short-Term Memory), It may be composed of any one of GRU (Gated Recurrent Unit).

컴퓨터로 구현되는 키워드 추출 장치는, 텍스트 데이터로부터 자연어 처리를 위한 임베딩을 수행하는 임베딩부; 및 상기 임베딩이 수행된 텍스트 데이터를 사용하여 개체명 인식(Named Entity Recognition)을 위하여 구성된 딥러닝 기반의 키워드 추출 모델을 학습시킴에 따라 키워드를 추출하는 키워드 추출부를 포함할 수 있다. A keyword extraction device implemented by a computer includes an embedding unit that performs embedding for natural language processing from text data; and a keyword extraction unit configured to extract keywords by learning a keyword extraction model based on deep learning configured for named entity recognition using the text data in which the embedding has been performed.

문자 메시지를 개체명 인식이 적용된 키워드 추출 모델을 통하여 학습시킴에 따라 보다 키워드 추출의 정확도를 향상시킬 수 있다. As text messages are learned through a keyword extraction model applied with entity name recognition, the accuracy of keyword extraction can be further improved.

도 1은 일 실시예에 따른 키워드 추출 장치에서 임베딩을 수행하는 동작을 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 키워드 추출 장치에서 키워드 추출 모델을 학습시키는 동작을 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 키워드 추출 장치에서 딥러닝 기반의 키워드 추출 구조를 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 키워드 추출 장치의 구성을 설명하기 위한 블록도이다.
도 5는 일 실시예에 따른 키워드 추출 장치에서 키워드 추출 방법을 설명하기 위한 흐름도이다. 1 is a diagram for explaining an operation of performing embedding in a keyword extraction apparatus according to an exemplary embodiment.
2 is a diagram for explaining an operation of learning a keyword extraction model in a keyword extraction apparatus according to an embodiment.
3 is a diagram for explaining a keyword extraction structure based on deep learning in a keyword extraction apparatus according to an embodiment.
4 is a block diagram for explaining the configuration of a keyword extraction device according to an embodiment.
5 is a flowchart illustrating a keyword extraction method in a keyword extraction apparatus according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, an embodiment will be described in detail with reference to the accompanying drawings.

실시예에서는 문자 메시지를 임베딩(embedding)하여 단어나 문장을 벡터로 변환한 뒤, 딥러닝을 적용시킨 개체명 인식(NER: Named Entity Recognition)을 사용하여 문자 메시지의 키워드를 추출하여 저장하는 방법 및 장치에 대하여 설명하기로 한다. In an embodiment, a method of embedding a text message, converting a word or sentence into a vector, and then extracting and storing keywords of the text message using Named Entity Recognition (NER) to which deep learning is applied, and The device will be described.

도 1은 일 실시예에 따른 키워드 추출 장치에서 임베딩을 수행하는 동작을 설명하기 위한 도면이다. 1 is a diagram for explaining an operation of performing embedding in a keyword extraction apparatus according to an exemplary embodiment.

키워드 추출 장치는 자연어를 처리하기 위하여 자연어를 계산 가능한 형태로 변환시킬 수 있다. 키워드 추출 장치는 임베딩을 통하여 자연어를 벡터로 변환시킬 수 있다. 말뭉치(corpus)에는 의미, 문법 정보가 응축되어있다. 키워드 추출 장치는 임베딩을 수행할 경우, 벡터화된 데이터를 획득하기 때문에 사칙연산이 가능하며, 단어/문서 관련도 역시 계산이 가능하다. 실시예에서는 임베딩의 다양한 기법 중 하나인 Word2Vec를 사용하여 텍스트 데이터(예를 들면, 문자 메시지)의 단어들을 임베딩 하고 NER을 적용시킬 수 있다. The keyword extraction device may convert the natural language into a computable form in order to process the natural language. The keyword extraction device may convert natural language into a vector through embedding. Semantic and grammatical information is condensed in the corpus. When performing embedding, the keyword extraction device obtains vectorized data, so it is possible to perform four arithmetic operations and calculate word/document relevance. In an embodiment, Word2Vec, one of various embedding techniques, may be used to embed words of text data (eg, text message) and apply NER.

Word2Vec는 CBOW(Continuous Bag of Words) 방식과 Skip-Gram 방식을 포함하는 두 가지 방식이 있다. CBOW 방식은 텍스트 데이터로 구성된 문자 메시지에 존재하는 단어들 중에서 하나의 단어를 기준으로 하나의 단어의 주변에 있는 복수 개의 단어를 이용하여 중간에 존재하는 단어들을 예측하는 것이다. 이때, 문자 메시지는 텍스트 데이터로 구성될 수 있으며, 텍스트 데이터는 문자, 기호, 숫자, 이모티콘 등을 포함할 수 있다. 또한, 문자 메시지는 적어도 하나 이상의 문장 또는 적어도 하나 이상의 단어들로 구성될 수 있다. Skip-Gram 방식은 문자 메시지에 존재하는 단어들 중에서 기 설정된 기준을 통한 중간에 존재하는 단어를 이용하여 주변 단어들을 예측하는 것이다. 실시예에서는 CBOW 방식을 적용하여 주변에 있는 단어들로 중간에 있는 단어를 예측하여 벡터화할 수 있다. 먼저, 키워드 추출 장치는 중심 단어를 예측하기 위하여 기 설정된 기준에서 앞 또는/및 뒤로 몇 개의 단어를 확인할 것인지 단어 확인 개수를 결정할 수 있고, 결정된 단어 확인 개수에 따라 윈도우(window)를 생성하여 중심 단어와 주변 단어를 바꾸어 가면서 학습을 위한 데이터를 생성할 수 있다. 중심 단어란 문장의 중간에 존재하는 단어일 수 있고, 문장을 구성하는데 필요한 핵심 단어를 의미할 수 있다. Word2Vec has two methods, including the CBOW (Continuous Bag of Words) method and the Skip-Gram method. The CBOW method predicts words existing in the middle by using a plurality of words around one word based on one word among words existing in a text message composed of text data. In this case, the text message may be composed of text data, and the text data may include characters, symbols, numbers, emoticons, and the like. Also, the text message may consist of one or more sentences or one or more words. The Skip-Gram method predicts neighboring words using a word existing in the middle through a predetermined criterion among words existing in a text message. In an embodiment, by applying the CBOW method, a word in the middle may be predicted and vectorized using words in the periphery. First, the keyword extracting device may determine the number of words to be identified to determine how many words forward or/or backward from a predetermined criterion to predict a central word, and create a window according to the determined number of identified words to generate a central word. Data for learning can be created by changing the word and surrounding words. The central word may be a word existing in the middle of a sentence, or may mean a key word necessary to construct a sentence.

도 1에 도시된 바와 같이, '복지관에서 6시에 잔치가 열립니다.'라는 문자 메시지(문장)을 임베딩 하는 것을 설명하기로 한다. 예를 들면, 키워드 추출 장치는 문자 메시지의 각각의 위치에서 앞, 뒤로 1개씩 단어를 볼 것을 결정하거나 또는, 동일하게, 문자 메시지의 각각의 위치에서 뒤로 2개씩 단어를 볼 것을 결정할 수도 있다. 키워드 추출 장치는 결정된 단어 확인 개수에 따라 윈도우를 생성하여 중심 단어와 주변 단어를 바꾸어 가면서 학습을 위한 데이터를 생성할 수 있다. As shown in FIG. 1, the embedding of a text message (sentence) saying 'a party will be held at 6 o'clock at the welfare center' will be described. For example, the keyword extracting device may determine to see one word at a time before and after each position of the text message, or may determine to look at two words at a time from each position in the text message. The keyword extracting apparatus may create a window according to the determined number of confirmed words and generate data for learning while exchanging the central word and the surrounding word.

도 1을 참고하면, Word2Vec의 동작 예시이다. 일례로, 문자 메시지가 Word2Vec에 입력 데이터로 입력될 경우, 미리 저장된 한국어 조사와 띄어쓰기를 기준으로 단어들을 구분하고 윈도우의 크기에 따라 분류할 수 있다. 이때, 한국어 조사와 띄어쓰기와 관련된 정보가 미리 저장되어 있을 수 있고, 또는 외부의 한국어 조사와 띄어쓰기와 관련된 정보를 제공하는 서비스와 연동시킬 수 있다. Referring to Figure 1, it is an example of the operation of Word2Vec. For example, when a text message is input to Word2Vec as input data, words may be classified based on pre-stored Korean postpositions and spaces, and may be classified according to the size of the window. In this case, information related to Korean research and spacing may be stored in advance, or may be interlocked with an external service that provides information related to Korean research and spacing.

각각의 윈도우는 원-핫(One-hot) 인코딩을 하여 프로젝션 레이어(projection layer)에 넣어 준다. 이것은 projection layer을 통과하며 룩업 테이블 연산을 한 뒤 각 윈도우의 중간에 있는 단어의 유사도 값이 출력값으로 출력될 수 있다. 예를 들면, '에서'와 '에'는 각각 장소 뒤에 붙는 조사와 시간 뒤에 붙는 조사이지만 같은 조사 이므로 비슷한 유사도 값을 가질 수 있다.Each window is subjected to one-hot encoding and put into a projection layer. It passes through the projection layer, and after calculating the lookup table, the similarity value of the word in the middle of each window can be output as an output value. For example, 'In' and 'E' are postpositions after a place and postpositions after a time, but they are the same postposition, so they can have similar similarity values.

도 2는 일 실시예에 따른 키워드 추출 장치에서 키워드 추출 모델을 학습시키는 동작을 설명하기 위한 도면이다.2 is a diagram for explaining an operation of learning a keyword extraction model in a keyword extraction apparatus according to an embodiment.

개체명 인식은 자연어처리 기술을 이용하여 문맥상 의미를 파악하여 엔티티(entity)를 추출하는 알고리즘이다. 엔티티를 사용자가 찾고자 하는 메타 데이터로 정의하면 문장에서 사용자가 원하는 것을 출력할 수 있다. 구체적으로, 개체명 인식이란 미리 정의해 둔 사람, 회사, 장소, 시간, 단위 등에 해당하는 단어(개체명)를 문서에서 인식하여 추출 분류하는 기법을 의미한다. 추출된 개체명은 인명(person), 지명(location), 기관명(organization), 시간(time) 등으로 분류될 수 있다. 개체명 인식은 정보 추출을 목적으로 시작되어 자연어 처리, 정보 검색 등에 사용되고 있다. Entity name recognition is an algorithm that extracts an entity by grasping contextual meaning using natural language processing technology. If you define an entity as the meta data the user wants to find, you can output what the user wants in a sentence. Specifically, entity name recognition refers to a technique of recognizing and extracting and classifying words (entity names) corresponding to a predefined person, company, place, time, unit, etc. in a document. The extracted entity name may be classified into person, location, organization name, time, and the like. Entity name recognition started for the purpose of information extraction and is used for natural language processing and information retrieval.

실시예에서는 개체명 인식을 통한 정보(예를 들면, 키워드) 추출의 정확도를 높이기 위하여 딥러닝 모델을 적용시킨 개체명 인식을 이용하여 텍스트 데이터로 구성된 문자 메시지에서 키워드가 추출될 수 있다. 이때, 문자 메시지는 전자 기기를 통하여 상대방 사용자와 송수신될 수 있으며, 송수신되는 문자 메시지에서 키워드가 추출될 수 있다. 예를 들면, 키워드 추출 장치는 추출된 키워드를 제공함에 있어서, 고령자를 고려하여 추출된 키워드의 크기를 확대하여 표시할 수 있다. 일례로, 키워드 추출 장치가 키워드 추출 서비스를 제공하는 어플리케이션 형태로 동작될 경우, 키워드 추출 서비스의 이용을 위하여 입력된 고령자의 식별 정보에 기초하여 고령자의 연령에 맞게 키워드의 크기를 확대시킬 수 있다. 또는, 고령자가 추출된 키워드를 확인함에 있어서, 디스플레이에 표시된 키워드의 크기가 작다고 느낄 경우, 키워드 추출 장치는 키워드 크기의 확대 또는 축소를 위한 유저 인터페이스를 제공할 수 있고, 고령자로부터 유저 인터페이스가 입력됨에 따라 키워드의 크기가 제어될 수도 있다. 또는, 키워드 추출 장치는 문자 메시지의 크기보다 크게 추출된 키워드를 제공할 수 있다. In the embodiment, in order to increase the accuracy of information (eg, keywords) extraction through entity name recognition, keywords may be extracted from text messages composed of text data using entity name recognition to which a deep learning model is applied. In this case, the text message may be transmitted and received with the other user through the electronic device, and keywords may be extracted from the text message transmitted and received. For example, when providing the extracted keywords, the keyword extraction device may magnify and display the extracted keywords in consideration of the elderly. For example, when the keyword extraction device is operated in the form of an application providing a keyword extraction service, the size of a keyword may be enlarged according to the age of the elderly person based on identification information of the elderly input for use of the keyword extraction service. Alternatively, when the elderly person checks the extracted keyword and feels that the size of the keyword displayed on the display is small, the keyword extraction device may provide a user interface for enlarging or reducing the size of the keyword, and inputting the user interface from the elderly person. The size of keywords may be controlled accordingly. Alternatively, the keyword extraction device may provide keywords extracted larger than the size of the text message.

키워드 추출 장치는 임베딩이 수행된 문자 메시지(임베딩된 벡터)를 사용하여 개체명 인식(Named Entity Recognition)을 위하여 구성된 딥러닝 기반의 키워드 추출 모델을 학습시킴에 따라 키워드를 추출할 수 있다. 이때, 딥러닝 기반의 키워드 추출 모델은 사전에 학습 데이터 셋을 통해 학습되어 있을 수 있다. The keyword extraction device may extract keywords by training a deep learning-based keyword extraction model configured for named entity recognition using an embedded text message (embedded vector). In this case, the deep learning-based keyword extraction model may be learned in advance through a training data set.

도 2를 참고하면, Word2Vec이 적용된 입력 값에 대해서 학습할 키워드 추출 모델의 예이다. 키워드 추출 모델에 문장이 입력될 수 있다. 다시 말해서, 키워드 추출 모델에 도 1의 Word2Vec를 통과하여 출력된 유사도 값들 중 조사를 제외한 단어가 입력값으로 입력될 수 있다. 유사도로 변경된 데이터를 입력으로 사용하여 키워드 추출 모델을 학습시킬 수 있다. 이때, 키워드 추출 모델은 자연어 처리와 관련된 딥러닝 기반의 모델일 수 있다. 키워드 추출 모델을 학습시킴에 따라 학습 결과로서, 각 문장별로 추출할 키워드의 라벨링 값을 출력할 수 있다. 이때, 키워드 추출 모델을 학습시킨 학습 결과로서, 문장이 출력될 수 있다. Referring to FIG. 2 , it is an example of a keyword extraction model to be learned for an input value to which Word2Vec is applied. Sentences may be input to the keyword extraction model. In other words, among the similarity values output through Word2Vec of FIG. 1, words excluding research words may be input to the keyword extraction model as input values. A keyword extraction model can be trained using the similarity-changed data as an input. In this case, the keyword extraction model may be a deep learning-based model related to natural language processing. As the keyword extraction model is trained, as a learning result, labeling values of keywords to be extracted for each sentence may be output. In this case, as a learning result of learning the keyword extraction model, a sentence may be output.

다른 예로서, 문자 메시지에 대한 문장의 시퀀스가 Word2Vec를 사용하여 임베딩을 수행함에 따라 생성된 임베딩 레이어를 통과하여 키워드 추출 모델로 입력될 수 있다. 임베딩 레이어를 통과한 문장의 시퀀스가 키워드 추출 모델로 입력됨에 따라 학습된 결과로서, 장소, 시간 또는 행사 중 적어도 하나 이상의 키워드가 출력될 수 있다. 출력된 장소, 시간 또는 행사 중 적어도 하나 이상의 키워드를 조합하여 문장이 생성될 수 있다.As another example, a sequence of sentences for a text message may be input to a keyword extraction model by passing through an embedding layer generated by performing embedding using Word2Vec. As the sequence of sentences passing through the embedding layer is input to the keyword extraction model, at least one keyword among place, time, and event may be output as a learned result. A sentence may be generated by combining at least one keyword among the output place, time, and event.

자연어 처리와 관련된 딥러닝에서는 주로 RNN(Recurrent Neural Network) 계열이 선호될 수 있다. RNN계열 딥러닝은 RNN, LSTM(Long Short-Term Memory), GRU(Gated Recurrent Unit) 등으로 구성될 수 있다. In deep learning related to natural language processing, RNN (Recurrent Neural Network) series may be preferred. RNN-based deep learning can be composed of RNN, LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit), and the like.

RNN의 경우 장기 메모리 상실 문제로 인해 RNN보다는 LSTM을 선호된다. 속도는 LSTM을 개조한 GRU를 사용하는 것이 유리하지만 성능 면에서는 LSTM이 조금 더 앞서는 경우가 있다. 실시예에서는 고령자에게 정확도 측면이 조금 더 중요하기 때문에 LSTM을 사용할 수 있다. For RNNs, LSTMs are preferred over RNNs due to the long-term memory loss problem. In terms of speed, it is advantageous to use a GRU modified from LSTM, but in terms of performance, LSTM is slightly superior in some cases. In the embodiment, since the aspect of accuracy is slightly more important for the elderly, LSTM can be used.

키워드 추출 모델에서 레이어의 경우, 키워드 추출 모델의 정확도와 연관성이 깊기 때문에 실험을 통해 결정될 수 있다. 또는, 사용자에 의하여 레이어가 결정될 수 있다. RNN 계열의 알고리즘은 문장의 앞에서 뒷방향인 포워드 방향(forward direction) 학습으로 시작하지만 문장의 뒤에서부터 앞으로의 역방향 백워드 방향(backward direction)을 추가하여 포워드 및 백워드(forward + backward)를 모두 고려하는 BiLSTM(Bidirectional LSTM)을 사용할 수 있다. 이는, 양방향 LSTM(bidirectional LSTM) 계열이 성능이 더 좋기 때문이다. In the case of the layer in the keyword extraction model, it can be determined through experimentation because it is closely related to the accuracy of the keyword extraction model. Alternatively, the layer may be determined by the user. RNN-based algorithms start with forward direction learning, which is the backward direction from the front of the sentence, but consider both forward and backward (forward + backward) by adding a backward direction from the back of the sentence to the front. BiLSTM (Bidirectional LSTM) can be used. This is because the bidirectional LSTM series has better performance.

도 3은 일 실시예에 따른 키워드 추출 장치에서 딥러닝 기반의 키워드 추출 구조를 설명하기 위한 도면이다. 3 is a diagram for explaining a keyword extraction structure based on deep learning in a keyword extraction apparatus according to an embodiment.

도 3을 참고하면, 딥러닝 기반 키워드 추출 시스템의 전체 구조이다. 앞서 설명한 바와 같이, BiLSTM을 사용하며 복수 개(예를 들면, 2개)의 레이어(layer)를 쌓은 모델을 도식한 것이다. 도 2에서의 설명에서와 같이 도1의 Word2Vec을 통해 임베딩된 벡터가 도 3의 입력으로 들어가게 된다. 입력은 BiLSTM 레이어를 통과하여 필요한 키워드를 출력할 수 있다. 이때, 필요한 키워드가 아닌 것의 출력은 -1(None)로 학습하여 추출하지 않도록 한다. Referring to Figure 3, the overall structure of the deep learning-based keyword extraction system. As described above, it is a diagram of a model in which a plurality of (eg, two) layers are stacked using BiLSTM. As in the description of FIG. 2, the vector embedded through Word2Vec of FIG. 1 is entered into the input of FIG. The input can pass through the BiLSTM layer and output required keywords. At this time, the output of keywords other than necessary is set to -1 (None) so that they are not extracted.

도 4는 일 실시예에 따른 키워드 추출 장치의 구성을 설명하기 위한 블록도이고, 도 5는 일 실시예에 따른 키워드 추출 장치에서 키워드 추출 방법을 설명하기 위한 흐름도이다. 4 is a block diagram illustrating a configuration of a keyword extraction device according to an embodiment, and FIG. 5 is a flowchart illustrating a keyword extraction method in the keyword extraction device according to an embodiment.

키워드 추출 장치(100)에 포함된 프로세서는 임베딩부(410) 및 키워드 추출부(420)를 포함할 수 있다. 이러한 프로세서 및 프로세서의 구성요소들은 도 5의 키워드 추출 방법이 포함하는 단계들(510 내지 520)을 수행하도록 키워드 추출 장치를 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. 여기서, 프로세서의 구성요소들은 키워드 추출 장치(100)에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. The processor included in the keyword extraction device 100 may include an embedding unit 410 and a keyword extraction unit 420 . These processors and components of the processors may control the keyword extraction device to perform steps 510 to 520 included in the keyword extraction method of FIG. 5 . In this case, the processor and components of the processor may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program. Here, the components of the processor may be expressions of different functions performed by the processor according to a control command provided by a program code stored in the keyword extraction device 100 .

프로세서는 키워드 추출 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 키워드 추출 장치(100)에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 키워드 추출 장치를 제어할 수 있다. The processor may load a program code stored in a file of a program for a keyword extraction method into a memory. For example, when a program is executed in the keyword extracting device 100, the processor may control the keyword extracting device to load a program code from a program file into a memory under the control of an operating system.

단계(510)에서 임베딩부(410)는 텍스트 데이터로부터 자연어 처리를 위한 임베딩을 수행할 수 있다. 이때, 텍스트 데이터는, 문자 메시지일 수 있다. 임베딩부(410)는 문자 메시지로부터 임베딩을 수행함에 따라 단어 또는 문장으로 분할하고, 분할된 단어 또는 문장을 벡터화할 수 있다. 임베딩부(410)는 CBOW(Continuous Bag of Words) 방식 및 Skip-Gram 방식을 포함하는 Word2Vec를 사용하여 임베딩을 수행할 수 있다. 임베딩부(410)는 문자 메시지로부터 CBOW(Continuous Bag of Words) 방식의 임베딩을 수행함에 따라 문자 메시지에서 각각의 단어의 주변에 있는 단어들을 이용하여 중심 단어들을 예측하고, 예측된 중심 단어들을 벡터화할 수 있다. In step 510, the embedding unit 410 may perform embedding for natural language processing from text data. In this case, the text data may be a text message. The embedding unit 410 may divide the text message into words or sentences as embedding is performed, and vectorize the divided words or sentences. The embedding unit 410 may perform embedding using Word2Vec including a Continuous Bag of Words (CBOW) method and a Skip-Gram method. As the embedding unit 410 performs CBOW (Continuous Bag of Words) embedding from the text message, it predicts center words using words around each word in the text message and vectorizes the predicted center words. can

단게(520)에서 키워드 추출부(420)는 임베딩이 수행된 텍스트 데이터를 사용하여 개체명 인식(Named Entity Recognition)을 위하여 구성된 딥러닝 기반의 키워드 추출 모델을 학습시킴에 따라 키워드를 추출할 수 있다. 키워드 추출부(420)는 Word2Vec를 사용하여 임베딩을 수행함에 따라 획득된 유사도로 변경된 데이터를 입력으로 사용하여 키워드 추출 모델을 학습시키고, 키워드 추출 모델을 학습시킴에 따라 결과값으로 각 문장값을 추출할 키워드의 라벨링 값을 출력할 수 있다. 키워드 추출부(420)는 개체명 인식(Named Entity Recognition)을 위하여 딥러닝 기반의 키워드 추출 모델을 구성할 수 있다. 키워드 추출부(420)는 문자 메시지에 대한 문장의 시퀀스가 Word2Vec를 사용하여 임베딩을 수행함에 따라 생성된 임베딩 레이어를 통과하여 키워드 추출 모델을 학습시키고, 키워드 추출 모델을 학습시킴에 따라 결과값으로 장소, 시간 또는 행사 중 적어도 하나 이상의 키워드를 출력하고, 출력된 적어도 하나 이상의 키워드를 조합하여 문장을 생성할 수 있다. In step 520, the keyword extraction unit 420 may extract keywords by learning a deep learning-based keyword extraction model configured for Named Entity Recognition using the text data in which the embedding has been performed. . The keyword extraction unit 420 trains a keyword extraction model by using as an input the data obtained by embedding and obtained by performing embedding, and extracts each sentence value as a result value by training the keyword extraction model. You can output the labeling value of the keyword to do. The keyword extraction unit 420 may configure a deep learning-based keyword extraction model for named entity recognition. The keyword extraction unit 420 passes through an embedding layer generated as the sequence of sentences for the text message is embedded using Word2Vec to train a keyword extraction model, and as a result of training the keyword extraction model, a location is obtained as a result value. At least one keyword of , time, or event may be output, and a sentence may be generated by combining the outputted at least one or more keywords.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the keyword extraction method performed by a keyword extraction device implemented by a computer,
performing embedding for natural language processing from text data using Word2Vec; and
Extracting keywords by learning a deep learning-based keyword extraction model configured for Named Entity Recognition using the embedding-performed text data
including,
The step of performing the embedding,
When the text data is input to Word2Vec as input data, words are divided based on pre-stored Korean postpositions and spaces, and windows are created according to the determined number of word checks for the separated words, and the center of the separated words is created. Creating data for learning while changing words and surrounding words, one-hot encoding the generated window, inputting it to the projection layer, and passing through the projection layer After calculating the lookup table, outputting the similarity value of the word in the middle of each window as an output value
including,
The step of extracting the keyword,
As the embedding is performed using the Word2Vec, words excluding postpositions among the similarity values of the words output as the output values are input as input values, and the keyword extraction model is trained using the data changed to the similarity values as inputs, Outputting a labeling value of a keyword to extract each sentence value as a learning result as the keyword extraction model is trained.
including,
The keyword extraction method,
Enlarging the size of the keywords extracted through the keyword extraction model according to the age of the elderly based on the identification information of the elderly, and displaying the extracted keywords on a display with the enlarged size of the keywords
Keyword extraction method.

According to claim 1,
The text data is a text message,
The step of performing the embedding,
Dividing the text message into words or sentences as embedding is performed, and vectorizing the divided words or sentences
Keyword extraction method comprising a.

According to claim 2,
The step of performing the embedding,
Steps for performing embedding using Word2Vec including CBOW (Continuous Bag of Words) method and Skip-Gram method
Keyword extraction method comprising a.

According to claim 2,
The step of performing the embedding,
predicting center words using words in the vicinity of each word in the text message as embedding is performed in a Continuous Bag of Words (CBOW) method from the text message, and vectorizing the predicted center words;
Keyword extraction method comprising a.

According to claim 1,
The text data is a text message,
The step of extracting the keyword,
The sequence of sentences for the text message passes through an embedding layer generated as embedding is performed using Word2Vec, and the keyword extraction model is trained. As the keyword extraction model is trained, the place, time, or event is used as a result value. outputting at least one keyword from among, and generating a sentence by combining the outputted at least one or more keywords;
Keyword extraction method comprising a.

delete

According to any one of claims 1 or 5,
The step of extracting the keyword,
Steps of constructing a keyword extraction model based on deep learning for Named Entity Recognition
including,
The keyword extraction model is composed of any one of RNN, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU)
A keyword extraction method characterized in that.

In the keyword extraction device implemented by a computer,
an embedding unit that performs embedding for natural language processing from text data using Word2Vec; and
Keyword extraction unit for extracting keywords by learning a deep learning-based keyword extraction model configured for Named Entity Recognition using the embedding-performed text data
including,
The embedding part,
When the text data is input to Word2Vec as input data, words are divided based on pre-stored Korean postpositions and spaces, and windows are created according to the determined number of word checks for the separated words, and the center of the separated words is created. Creating data for learning while changing words and surrounding words, one-hot encoding the generated window, inputting it to the projection layer, and passing through the projection layer As a result, after calculating the lookup table, outputting the similarity value of the word in the middle of each window as an output value,
The keyword extraction unit,
As the embedding is performed using the Word2Vec, words excluding postpositions among the similarity values of the words output as the output values are input as input values, and the keyword extraction model is trained using the data changed to the similarity values as inputs, Outputting a labeling value of a keyword to extract each sentence value as a learning result as the keyword extraction model is trained,
The keyword extraction device,
Enlarging the size of the keywords extracted through the keyword extraction model according to the age of the elderly based on the identification information of the elderly, and displaying the extracted keywords on a display with the enlarged size of the keywords
Keyword extraction device.