KR102150560B1

KR102150560B1 - Apparatus and method for analyzing target using topic

Info

Publication number: KR102150560B1
Application number: KR1020180023009A
Authority: KR
Inventors: 이석준; 조윤재; 윤재웅; 전재헌; 송현정; 전종수
Original assignee: 광운대학교 산학협력단
Priority date: 2018-02-26
Filing date: 2018-02-26
Publication date: 2020-09-01
Also published as: KR20190102529A

Abstract

본 발명은 커뮤니티를 통해 이벤트와 관련된 타겟들을 분석하는 타겟 분석 장치 및 방법을 제안한다. 본 발명에 따른 장치는 이벤트와 관련된 텍스트 메시지들을 수집하여 전처리하는 메시지 처리부; 의미를 내포하고 있는 것으로서 텍스트 메시지들에 포함되어 있는 단어들에 부여된 가중치들을 기초로 텍스트 메시지들과 관련된 문서들로부터 토픽들을 키워드로 검출하는 키워드 검출부; 및 토픽들에 대하여 소셜 네트워크 분석을 실행하여 연령별 및 연도별로 타겟들을 도출하는 타겟 도출부를 포함한다.The present invention proposes a target analysis device and method for analyzing targets related to an event through a community. An apparatus according to the present invention includes a message processing unit for collecting and preprocessing text messages related to an event; A keyword detection unit for detecting topics as keywords from documents related to text messages based on weights assigned to words included in the text messages as having meanings; And a target derivation unit for deriving targets by age and year by performing social network analysis on topics.

Description

Target analysis device and method using topic {Apparatus and method for analyzing target using topic}

본 발명은 타겟을 분석하는 장치 및 방법에 관한 것이다. 보다 상세하게는, 토픽을 이용하여 타겟을 분석하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for analyzing a target. More specifically, it relates to an apparatus and method for analyzing a target using a topic.

스트레스(stress)는 생체에 가해지는 여러 상해 및 자극에 대하여 체내에서 일어나는 비특이적인 생물 반응으로서, 인간이 심리적 또는 신체적으로 감당하기 어려운 상황에 처했을 때 느끼는 불안과 위협의 감정 등을 말한다.Stress is a non-specific biological reaction that occurs in the body to various injuries and stimuli inflicted on the living body, and refers to the feelings of anxiety and threats that humans feel when they are in a situation that is difficult to handle psychologically or physically.

그런데 이러한 스트레스가 해소되지 않은 채 체내에 누적되면 사람의 건강에 근본적인 손상이 발생할 수 있으며, 이로 인해 심장병, 비만, 당뇨, 암 등 다양한 질병에 노출되거나 자살 등 사망에 이르기도 한다.However, if such stress is accumulated in the body without being resolved, fundamental damage to human health may occur, and this may lead to exposure to various diseases such as heart disease, obesity, diabetes, cancer, etc., or death such as suicide.

한국공개특허 제2017-0115037호 (공개일 : 2017.10.16.)Korean Patent Publication No. 2017-0115037 (Publication date: 2017.10.16.)

본 발명은 상기한 문제점을 해결하기 위해 안출된 것으로서, 커뮤니티를 통해 수집된 텍스트 메시지에서 가중치를 기초로 토픽(topic)들을 검출하고, 이 토픽들에 대하여 소셜 네트워크 분석(SNA; Social Network Analysis)을 실행하여 스트레스 요인과 같은 타겟들을 도출하는 타겟 분석 장치 및 방법을 제안하는 것을 목적으로 한다.The present invention has been devised to solve the above problem, and detects topics based on weights in text messages collected through a community, and performs social network analysis (SNA) on these topics. It is an object of the present invention to propose a target analysis apparatus and method for deriving targets such as stress factors by executing.

그러나 본 발명의 목적은 상기에 언급된 사항으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the object of the present invention is not limited to the above-mentioned matters, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

본 발명은 상기한 목적을 달성하기 위해 안출된 것으로서, 이벤트와 관련된 텍스트 메시지들을 수집하여 전처리하는 메시지 처리부; 의미(meaning)를 내포하고 있는 것으로서 상기 텍스트 메시지들에 포함되어 있는 단어(term)들에 부여된 가중치들을 기초로 상기 텍스트 메시지들과 관련된 문서들로부터 토픽(topic)들을 키워드로 검출하는 키워드 검출부; 및 상기 토픽들에 대하여 소셜 네트워크 분석(Social Network Analysis)을 실행하여 연령별로 타겟들을 도출하는 타겟 도출부를 포함하는 것을 특징으로 하는 타겟 분석 장치를 제안한다.The present invention, as conceived to achieve the above object, is a message processing unit for collecting and pre-processing text messages related to events; A keyword detection unit that has meaning and detects topics as keywords from documents related to the text messages based on weights assigned to terms included in the text messages; And a target derivation unit for deriving targets for each age by executing a social network analysis on the topics.

또한 본 발명은 이벤트와 관련된 텍스트 메시지들을 수집하여 전처리하는 단계; 의미(meaning)를 내포하고 있는 것으로서 상기 텍스트 메시지들에 포함되어 있는 단어(term)들에 부여된 가중치들을 기초로 상기 텍스트 메시지들과 관련된 문서들로부터 토픽(topic)들을 키워드로 검출하는 단계; 및 상기 토픽들에 대하여 소셜 네트워크 분석(Social Network Analysis)을 실행하여 연령별로 타겟들을 도출하는 단계를 포함하는 것을 특징으로 하는 타겟 분석 방법을 제안한다.In addition, the present invention comprises the steps of collecting and preprocessing text messages related to events; Detecting topics as keywords from documents related to the text messages based on weights assigned to terms included in the text messages as having meaning; And deriving targets for each age group by performing a social network analysis on the topics.

또한 본 발명은 컴퓨터에서 타겟 분석 방법을 실행시키기 위한 컴퓨터 판독 가능 매체에 저장된 컴퓨터 프로그램을 제안한다.Further, the present invention proposes a computer program stored in a computer-readable medium for executing a target analysis method in a computer.

본 발명은 상기한 목적 달성을 위한 구성들을 통하여 다음과 같은 효과를 얻을 수 있다.The present invention can obtain the following effects through configurations for achieving the above object.

첫째, 연령별 및 연도별로 타겟들을 효과적으로 도출하는 것이 가능해진다.First, it becomes possible to effectively derive targets by age and year.

둘째, 타겟들 사이의 상관관계를 분석하여 사회 문제를 해결하기 위한 방안을 제시하는 것이 가능해진다.Second, it becomes possible to propose a plan for solving social problems by analyzing the correlation between targets.

도 1은 본 발명의 일실시예에 따른 타겟 분석 장치를 개략적으로 도시한 개념도이다.
도 2는 타겟 분석 장치를 구성하는 데이터 수집 및 전처리부의 내부 구성을 개략적으로 도시한 블록도이다.
도 3은 타겟 분석 장치에 구비되는 데이터 수집 및 전처리부의 기능을 설명하기 위한 참고도이다.
도 4 내지 도 7은 타겟 분석 장치에 구비되는 토픽 추출부의 기능을 설명하기 위한 참고도들이다.
도 8 내지 도 12는 타겟 분석 장치에 구비되는 토픽 기반 네트워크 분석부의 기능을 설명하기 위한 참고도이다.
도 13은 본 발명의 바람직한 실시예에 따른 타겟 분석 장치의 내부 구성을 개략적으로 도시한 개념도이다.1 is a conceptual diagram schematically showing a target analysis device according to an embodiment of the present invention.
2 is a block diagram schematically showing an internal configuration of a data collection and preprocessor constituting a target analysis device.
3 is a reference diagram for explaining functions of a data collection and preprocessing unit provided in a target analysis device.
4 to 7 are reference diagrams for explaining a function of a topic extraction unit provided in the target analysis device.
8 to 12 are reference diagrams for explaining the function of a topic-based network analysis unit provided in the target analysis device.
13 is a conceptual diagram schematically showing the internal configuration of a target analysis device according to a preferred embodiment of the present invention.

이하, 본 발명의 바람직한 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성요소들에 참조 부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 바람직한 실시예를 설명할 것이나, 본 발명의 기술적 사상은 이에 한정하거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있음은 물론이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. First, in adding reference numerals to elements of each drawing, it should be noted that the same elements are to have the same numerals as possible even if they are indicated on different drawings. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the subject matter of the present invention, a detailed description thereof will be omitted. In addition, a preferred embodiment of the present invention will be described below, but the technical idea of the present invention is not limited or limited thereto, and may be modified and variously implemented by those skilled in the art.

본 발명은 토픽(topic)을 이용하여 이벤트(event)와 관련된 타겟을 분석하는 장치에 관한 것으로서, 일실시예로 연령별로 생활 사건(life event)과 관련된 스트레스 요인(stressor)을 분석하는 장치에 관한 것이다. 본 발명에서는 타겟을 분석하기 위해 소셜 네트워크 분석 기법(SNA; Social Network Analysis), 일례로 토픽 기반 소셜 네트워크 분석 기법(topic-based SNA)을 이용한다. 이하 도면 등을 참조하여 본 발명에 대해 자세하게 설명한다.The present invention relates to an apparatus for analyzing a target related to an event using a topic, and in one embodiment, related to an apparatus for analyzing a stressor related to a life event according to age. will be. In the present invention, a social network analysis technique (SNA), for example, a topic-based social network analysis technique (topic-based SNA), is used to analyze a target. Hereinafter, the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일실시예에 따른 타겟 분석 장치를 개략적으로 도시한 개념도이다.1 is a conceptual diagram schematically showing a target analysis device according to an embodiment of the present invention.

도 1에 따르면, 타겟 분석 장치(100)는 데이터 수집 및 전처리부(data collection and preprocessing; 110), 토픽 추출부(topic extraction; 120) 및 토픽 기반 네트워크 분석부(topic-based network analysis; 130)를 포함한다.Referring to FIG. 1, the target analysis apparatus 100 includes a data collection and preprocessing unit 110, a topic extraction unit 120, and a topic-based network analysis unit 130. Includes.

데이터 수집 및 전처리부(110)는 텍스트 메시지들(text messages)을 수집하여 이 텍스트 메시지들을 전처리하는 기능을 수행한다. 데이터 수집 및 전처리부(110)는 이를 위해 도 2에 도시된 바와 같이 메시지 수집부(111), 비표준 언어 제거부(112), 텍스트 메시지 분류부(113) 및 VSM(Vector Space Modeling) 실행부(114)를 포함할 수 있다. 도 2는 타겟 분석 장치를 구성하는 데이터 수집 및 전처리부의 내부 구성을 개략적으로 도시한 블록도이다.The data collection and preprocessor 110 collects text messages and performs a function of preprocessing these text messages. For this purpose, the data collection and preprocessing unit 110 includes a message collection unit 111, a non-standard language removal unit 112, a text message classification unit 113, and a vector space modeling (VSM) execution unit ( 114) may be included. 2 is a block diagram schematically showing an internal configuration of a data collection and preprocessor constituting a target analysis device.

메시지 수집부(111)는 SNS(Social Network Services), 카페, 블로그 등 온라인 커뮤니티(online community)로부터 텍스트 메시지들을 수집하는 기능을 수행한다. 본 발명에서는 일례로 생활 사건과 관련된 여성들의 스트레스 요인을 분석하기 위해, 메시지 수집부(111)가 여성들이 많이 이용하는 온라인 커뮤니티로부터 텍스트 메시지들을 수집할 수 있다. 메시지 수집부(111)는 이렇게 수집된 메시지들을 미리 구비된 메모리에 저장한다.The message collection unit 111 performs a function of collecting text messages from online communities such as SNS (Social Network Services), cafes, and blogs. In the present invention, for example, in order to analyze the stress factors of women related to life events, the message collection unit 111 may collect text messages from an online community frequently used by women. The message collection unit 111 stores the collected messages in a memory provided in advance.

비표준 언어 제거부(112)는 메시지 수집부(111)에 의해 수집된 각각의 텍스트 메시지에서 의미없는 특징들(meaningless characters), 중요하지 않은 단어들(unimportant words), 접두사(prefix), 접미사(suffix) 등 비표준 언어(non-standard language)를 제거하는 기능(removing non-standard language)을 수행한다. 데이터 수집 및 전처리부(110)는 비표준 언어 제거부(112)의 이러한 기능을 통해 각 메시지의 내용을 대표하기에 적합한 단어들로 구성된 리스트를 획득할 수 있다.The non-standard language removal unit 112 includes meaningless characters, unimportant words, prefixes, and suffixes in each text message collected by the message collection unit 111. ) And other non-standard languages (removing non-standard language). The data collection and preprocessor 110 may obtain a list of words suitable to represent the contents of each message through this function of the non-standard language removal unit 112.

텍스트 메시지 분류부(113)는 비표준 언어 제거부(112)에 의해 텍스트 메시지로부터 비표준 언어가 제거되면 텍스트 메시지에 포함된 문장들을 단어(term) 또는 형태소(morpheme) 단위로 분할하는 토큰화(tokenization) 기능을 수행한다. 텍스트 메시지 분류부(113)는 토큰화 기능을 수행하는 데에 POS(Part-Of-Speech) 태깅(tagging) 기법을 이용할 수 있다.When the non-standard language is removed from the text message by the non-standard language removal unit 112, the text message classifying unit 113 divides the sentences included in the text message into terms or morphemes. Functions. The text message classifier 113 may use a POS (Part-Of-Speech) tagging technique to perform a tokenization function.

POS 태깅 기법은 문장에 포함되어 있는 단어들의 품사를 식별하여 태그를 붙여주는 기법을 말한다. 텍스트 메시지 분류부(113)는 POS 태깅 기법을 이용하여 문장들을 명사들(nouns), 동사들(verbs), 형용사들(adjectives) 등 토큰화된 단어들(tokenized terms)로 분류할 수 있다. 또한 텍스트 메시지 분류부(113)는 POS 태깅 기법을 이용하여 토큰화된 단어들을 튜플(tuple)의 형태로 출력할 수 있다.The POS tagging technique is a technique that identifies the parts of speech of words included in a sentence and attaches tags. The text message classifier 113 may classify sentences into tokenized terms such as nouns, verbs, and adjectives using a POS tagging technique. In addition, the text message classifier 113 may output tokenized words in the form of a tuple using a POS tagging technique.

VSM 실행부(114)는 벡터 공간 모델링(VSM; Vector Space Modeling) 기법을 이용하여 텍스트 메시지 분류부(113)에 의해 토큰화된 단어들을 정규화되고 가중치가 부여된 단어들(normalized weighted terms)로 변환시키는 기능을 수행한다. 상기에서 벡터 공간 모델링 기법은 단어들과 메시지들을 벡터들로 표현하는 정보 검색 기술(information retrieval technique)을 말한다.The VSM execution unit 114 converts words tokenized by the text message classifier 113 into normalized weighted terms using a vector space modeling (VSM) technique. It performs the function to make. In the above, the vector space modeling technique refers to an information retrieval technique in which words and messages are expressed as vectors.

VSM 실행부(114)는 토큰화된 단어들을 정규화되고 가중치가 부여된 단어들로 변환시킬 때에 토큰화된 단어들과 그 단어들을 포함하고 있던 텍스트 메시지들을 도 3에 도시된 바와 같이 매트릭스 형태로 나타낼 수 있다. 도 3은 VSM 실행부(114)의 기능을 설명하기 위한 참고도이다.When converting the tokenized words into normalized and weighted words, the VSM execution unit 114 displays tokenized words and text messages containing the words in a matrix form as shown in FIG. 3. I can. 3 is a reference diagram for explaining the function of the VSM execution unit 114.

도 3은 VSM의 도식적인 프리젠테이션(schematic presentation)이다. VSM 실행부(114)는 도 3에 도시된 바와 같이 m개의 토큰화된 단어들(m-terms)과 n개의 텍스트 메시지들(n-documents)이 각각 행과 열에 배치되는 매트릭스(X_t,d) 형태로 나타낼 수 있다. VSM 실행부(114)는 이를 통해 n개의 텍스트 메시지들(Document 1, Document 2, Document 3, …, Document N)에서 m개의 토큰화된 단어들(Term 1, Term 2, Term 3, …, Term M)의 빈도를 산출하는 것이 가능해진다.3 is a schematic presentation of a VSM. As shown in FIG. 3, the VSM execution unit 114 includes m tokenized words (m-terms) and n text messages (n-documents) in a matrix (X _t,d ) arranged in rows and columns, respectively. ) Can be expressed. Through this, the VSM execution unit 114 uses m tokenized words (Term 1, Term 2, Term 3, …, Term) in n text messages (Document 1, Document 2, Document 3, …, Document N). It becomes possible to calculate the frequency of M).

VSM 실행부(114)는 벡터 공간 모델링 기법을 이용하여 각 텍스트 메시지를 대표하는 단어들의 빈도(frequency)를 토대로 각 단어에 가중치를 부여하는 기능을 수행한다. VSM 실행부(114)는 TF(Term Frequencies) 값을 산출하는 방법과 IDF(Inverse Document Frequencies) 값을 산출하는 방법을 기초로 각 단어에 가중치를 부여할 수 있다. TF 값을 산출하는 방법은 특정 텍스트 메시지 내에 나타나는 단어들의 총 빈도수를 계산하는 방법을 말하며, IDF 값을 산출하는 방법은 텍스트 메시지 집합 전체에 나타나는 특정 단어의 총 빈도수를 계산하는 방법을 말한다. VSM 실행부(114)는 TF 값을 산출하는 방법과 IDF 값을 산출하는 방법을 통해 많은 텍스트 메시지들에서 흔하게 나타나는 단어에 높은 가중치가 부여되는 것을 방지하는 효과를 얻을 수 있다.The VSM execution unit 114 performs a function of assigning a weight to each word based on the frequency of words representing each text message using a vector space modeling technique. The VSM execution unit 114 may assign a weight to each word based on a method of calculating a Term Frequencies (TF) value and a method of calculating an Inverse Document Frequencies (IDF) value. The method of calculating the TF value refers to a method of calculating the total frequency of words appearing in a specific text message, and the method of calculating the IDF value refers to a method of calculating the total frequency of specific words appearing in the entire text message set. The VSM execution unit 114 can obtain an effect of preventing a high weight from being assigned to words that are common in many text messages through a method of calculating a TF value and a method of calculating an IDF value.

VSM 실행부(114)는 각 단어에 가중치가 부여되면 각 텍스트 메시지를 정규화시키는 기능을 수행한다. 도 3의 매트릭스 X_t,d에서 모든 텍스트 메시지들은 동일한 유의성(significance)을 가져야 하므로, 본 발명에서는 이 점을 참작하여 VSM 실행부(114)가 각 텍스트 메시지에 대하여 정규화 기능을 수행한다.The VSM execution unit 114 performs a function of normalizing each text message when a weight is given to each word. Since all text messages in the matrix X _t,d of FIG. 3 must have the same significance, the VSM execution unit 114 performs a normalization function for each text message in consideration of this point in the present invention.

VSM 실행부(114)는 텍스트 메시지의 정규화를 위해 다음 수학식 1을 통해 가중치가 부여된 각 단어의 정규화 값을 산출할 수 있다.The VSM execution unit 114 may calculate a normalization value of each word to which a weight is assigned through Equation 1 below for normalization of the text message.

상기에서 w_id는 각각의 단어-텍스트 메시지 조합(term-document combination)을 위한 단어의 가중치 값(weighted term value)을 의미한다. 또한 n은 문서의 개수를 의미한다.In the above, w _id denotes a weighted term value of a word for each term-document combination. Also, n means the number of documents.

토픽 추출부(120)는 데이터 수집 및 전처리부(110)에 의해 전처리된 텍스트 메시지들로부터 키워드를 추출하기 위해 토픽 분석(topic analysis)을 실행하는 기능을 수행한다. 즉, 토픽 추출부(120)는 텍스트 메시지들에 내포되어 있는 의미를 분석하여 타겟과 관련된 메시지들을 추출하며, 이렇게 추출된 메시지들로부터 가중치가 기준값 이상인 단어들을 키워드로 추출한다.The topic extraction unit 120 performs a function of performing a topic analysis to extract keywords from text messages preprocessed by the data collection and preprocessing unit 110. That is, the topic extracting unit 120 extracts messages related to the target by analyzing the meaning contained in the text messages, and extracts words having a weight greater than or equal to a reference value from the extracted messages as keywords.

일례로 생활 사건과 관련된 여성들의 스트레스를 분석하기 위해, 토픽 추출부(120)는 여성들의 온라인 활동들(online activities)로부터 텍스트 안에 숨겨진 의미적 구성(semantic construction)의 패턴과 뜻을 식별하고, 여성들의 스트레스를 이해하기 위해 함께 나타나는 단어들(co-occurring words)의 되풀이하여 발생하는 패턴(recurring pattern)에 따라 키워드를 추출할 수 있다.For example, in order to analyze women's stress related to life events, the topic extraction unit 120 identifies patterns and meanings of semantic construction hidden in the text from women's online activities. Keywords can be extracted according to the recurring pattern of co-occurring words to understand the stress of

토픽 추출부(120)는 VSM 실행부(114)에 의해 단어들에 가중치가 부여되면 이 단어들을 토대로 주요 토픽들(topics)을 추출하기 위해 잠재 의미 분석(LSA; Latent Semantic Analysis) 알고리즘을 실행하는 기능을 수행한다. 즉, 토픽 추출부(120)는 텍스트 이미지-단어 공간(document-term space)에서 텍스트 이미지를 분류하여 코퍼스(corpus)의 의미 구조들을 밝히고, 이를 토대로 텍스트 이미지에 내포되어 있는 토픽들(즉, 잠재적 의미들(latent meanings))을 추출하는 기능을 수행한다.The topic extraction unit 120 executes a Latent Semantic Analysis (LSA) algorithm in order to extract main topics based on the words when weights are given to words by the VSM execution unit 114. Functions. That is, the topic extraction unit 120 classifies the text image in a text image-word space to reveal the semantic structures of the corpus, and based on this, the topics contained in the text image (that is, potential It performs a function of extracting latent meanings.

또한 토픽 추출부(120)는 토픽들을 토대로 텍스트 이미지들을 수월하게 식별하고 분류하기 위해 텍스트 이미지에서 의미있는 요소들을 포함하고 있지 않은 부분을 제거하는 기능도 수행한다.In addition, the topic extraction unit 120 also performs a function of removing a portion of the text image that does not include meaningful elements in order to easily identify and classify text images based on topics.

텍스트 이미지의 내용에 대한 일반화된 모델은 다음과 같이 표현될 수 있다.A generalized model for the content of a text image can be expressed as follows.

X = ΛF + εX = ΛF + ε

상기에서 X는 텍스트 이미지들의 코퍼스를 의미하며, ΛF는 대용량 텍스트 데이터에서 토픽들의 선형 결합(linear combination)을 의미한다. 또한 ε은 에러 단어(error term)를 의미한다.In the above, X denotes a corpus of text images, and ΛF denotes a linear combination of topics in large-capacity text data. In addition, ε means an error term.

토픽 추출부(120)에 의해 수행되는 잠재 의미 분석(LSA) 알고리즘은 특이값 분해(SVD; Singular Value Decomposition) 및 차원 축소(dimensionality reduction)의 실행을 포함한다.The latent semantic analysis (LSA) algorithm performed by the topic extraction unit 120 includes execution of Singular Value Decomposition (SVD) and dimensionality reduction.

특이값 분해(SVD) 프로세스는 수학적으로 도 4에 도시된 바와 같이 행렬 X_t,d를 U_t,k, Σ_k,k, V_k,d 등 세 행렬의 곱으로 분해한다. 도 4는 타겟 분석 장치에 구비되는 토픽 추출부의 기능을 설명하기 위한 제1 참고도이다.The singular value decomposition (SVD) process mathematically decomposes the matrix X _t,d by the product of three matrices such as U _t,k , Σ _k,k , and V _k,d as shown in FIG. 4. 4 is a first reference diagram for explaining a function of a topic extraction unit provided in a target analysis device.

도 4는 특이값 분해(SVD) 절차를 도식적으로 나타낸 다이어그램이다. 도 4에서 U_t,k는 XX^T 행렬의 고유 벡터들(eigenvectors)들을 포함하는 단어별 요소(term-by-factor) 행렬을 의미한다. Σ_k,k는 단어들의 잠재적 의미 요소들을 나타내는 단어별(term-by-term) 공분산 행렬(covariance matrix)을 의미한다. 이러한 Σ_k,k는 감소하는 순서에 따라 랭크된 특이값들의 대각 행렬(diagonal matrix)로 나타낼 수 있다. V_k,d는 X^TX의 고유벡터들을 보여주는 요소별 문서(factor-by-document) 행렬을 의미한다. 상기에서 X^TX는 각 요소에 대한 각 문서의 로딩(loading)을 보여주는 문서별(document-by-document) 공분산 행렬을 의미한다. 문서 로딩들(document loadings)은 식별된 토픽들에 대한 문서의 관련 정도를 나타낸다.4 is a diagram schematically showing a singular value decomposition (SVD) procedure. In FIG. 4, U _t,k denotes a term-by-factor matrix including eigenvectors of the XX ^T matrix. Σ _k,k denotes a term-by-term covariance matrix representing potential semantic elements of words. These Σ _k,k can be expressed as a diagonal matrix of singular values ranked in decreasing order. V _k,d denotes a factor-by-document matrix showing the eigenvectors of X ^T X. In the above, X ^T X denotes a document-by-document covariance matrix showing the loading of each document for each element. Document loadings indicate the degree of relevance of the document to the identified topics.

특이값 분해(SVD)를 수학 표기법(mathematical notation)에 따라 나타내면 다음과 같다.Singular value decomposition (SVD) is expressed according to mathematical notation as follows.

상기에서 U는 단어의 고유 벡터들을 의미하며, V는 문서의 고유 벡터들을 의미한다. 또한 Σ는 고유 벡터들의 제곱근들을 포함하는 특이값들의 대각 행렬을 의미하며, T는 전치(transposition)를 의미한다.In the above, U means eigenvectors of a word, and V means eigenvectors of a document. In addition, Σ denotes a diagonal matrix of singular values including square roots of eigenvectors, and T denotes transposition.

토픽 추출부(120)는 TSVD(Truncated SVD)를 이용하여 매우 중요한 단어들만 포함하는 단어 빈도들(term frequencies)을 수정할 수 있다. TSVD 프로세스는 주요 요인 분석 기법(PCA; Principal Component Analysis)과 매우 유사한 것으로서, 상관 요소들(correlated factors)을 주요 요인들(principal components)의 집합으로 변환하는 직교 변환(orthogonal transformation)을 포함한다.The topic extracting unit 120 may modify term frequencies including only very important words using TSVD (Truncated SVD). The TSVD process is very similar to Principal Component Analysis (PCA) and involves orthogonal transformation, which transforms correlated factors into a set of principal components.

TSVD 모델은 텍스트 데이터의 큰 코퍼스에서 토픽들이 저차원 공간에서 구성된다는 것을 인식하여 X_t,d의 차원 축소(dimensionality reduction)와 관련된다. 고유값들(eigenvalues)의 임계치(threshold)는 일반적으로 1 이상으로 설정된다. TSVD는 단어들의 동의성(synonymy)과 다의성(polysemy)의 문제를 효과적으로 해결하고 질의 성능(inquiry performance)에 상당한 영향을 미친다. TSVD를 수학 표기법에 다라 나타내면 다음과 같다.The TSVD model is related to the dimensionality reduction of X _t,d by recognizing that topics are organized in a low-dimensional space in a large corpus of text data. The threshold of eigenvalues is generally set to 1 or more. TSVD effectively solves the problem of synonymy and polysemy of words and has a significant influence on inquiry performance. If TSVD is expressed in terms of mathematical notation, it is as follows.

상기에서 ^{^}X_t,d는 단어 및 문서와 관련된 행렬(term-by-document matrix)를 의미하며, U_t,k는 요소들에 의한 단어들의 절두형 버전(truncated vertion)을 의미한다. 또한 Σ_k,k는 절두형 요소별 행렬 솔루션(truncated factor-by-factor matrix solution)을 의미하며, V^T _k,d는 문서들에 의한 요소들의 절두형 행렬(truncated matrix)을 의미한다. k는 특이값들의 순위(ranks)를 의미한다.Here, ^{^} X _t,d denotes a term-by-document matrix related to words and documents, and U _t,k denotes a truncated version of words by elements. Also, Σ _k,k denotes a truncated factor-by-factor matrix solution, and V ^T _k,d denotes a truncated matrix of elements by documents. k means ranks of singular values.

도 5는 타겟 분석 장치에 구비되는 토픽 추출부의 기능을 설명하기 위한 제2 참고도이다. 도 5는 잠재 의미 분석(LSA) 알고리즘에서 TSVD의 동작을 설명하여 준다. 도 5에서 도면부호 210 ~ 230은 각각 U_t,k, Σ_k,k 및 V^T _k,d에서 절단된 부분을 나타낸다.5 is a second reference diagram for explaining the function of a topic extraction unit provided in the target analysis device. 5 illustrates the operation of TSVD in the Latent Semantic Analysis (LSA) algorithm. In FIG. 5, reference numerals 210 to 230 denote portions cut at U _t,k , Σ _k,k and V ^T _k,d , respectively.

직교 정규성(orthonormality) U^TU = I 및 V^TV = I(여기서 I는 k * k 단위 행렬(identity matrix)임)는 단어 로딩들(term loadings; L_T)과 문서 로딩들(document loadings; L_D)을 획득하는 데에 이용된다. 이를 수학식으로 나타내면 다음과 같다.Orthonormality U ^T U = I and V ^T V = I (where I is a k * k identity matrix) is defined as term loadings (L _T ) and document loadings; It is used to obtain L _D ). This can be expressed as an equation as follows.

상기에서 L_T는 단어들과 잠재 토픽들(latent topics) 사이의 연관성을 나타내는 단어별 요소 행렬(term-by-factor matrix)을 의미하며, L_D는 잠재 토픽들을 가진 문서들 사이의 관계성을 나타내는 문서별 단위 행렬(document-by-term matrix)을 의미한다.In the above, L _T denotes a term-by-factor matrix representing the association between words and latent topics, and L _D denotes the relationship between documents with potential topics. It refers to a document-by-term matrix that is represented.

차원들(dimensions)의 최적의 수(optimal number)가 검출되고 텍스트 데이터가 TSVD를 통해 압축되면(condensed), 토픽 추출부(120)는 초기 결과들(initial results)의 해석 가능성(interpretability)을 향상시키기 위해 베리맥스 회전들(varimax rotations)을 이용하여 LSA 후 정량 분석(post-LSA quantitative analysis)을 수행한다. 도 6은 연접식 요소 공간(articulated factor space)에서 단어 로딩들과 문서 로딩들을 나타낸 것이다. 도 6은 단어 회전 로딩 동작들(term-rotated loading operations)과 문서 회전 로딩 동작들(document-rotated loading operations) 및 그것들의 출력들을 보여준다.When the optimal number of dimensions is detected and text data is condensed through TSVD, the topic extraction unit 120 improves the interpretability of initial results. To do this, a post-LSA quantitative analysis is performed using varimax rotations. 6 shows word loadings and document loadings in an articulated factor space. 6 shows term-rotated loading operations and document-rotated loading operations and their outputs.

본 발명에서는 일련화된 토픽 모델들이 존재한다. 각각의 토픽은 단어들(키워드들)의 분포의 집합(set of distribution)으로 구성된다. 문서에서 모든 단어들은 분포의 양에 따라 순위가 결정된다. 단어 토픽 가중치(term topic weight)는 각 토픽에 해당하는 가중치로 할당된다. 25개의 토픽들이 추출되면, 각각의 단어에 대해 25개의 단어 토픽 가중치가 계산된다. 각각의 토픽은 SVD 차원이기 때문에, 단어 토픽 가중치들은 SVD 공간에서 단어의 좌표값을 나타낸다. 도 7은 30대 여성의 스트레스 요인과 관련하여 추출된 토픽들과 그 토픽들의 가중치를 보여주는 예시도이다.In the present invention, serialized topic models exist. Each topic consists of a set of distributions of words (keywords). All words in the document are ranked according to the amount of distribution. The term topic weight is assigned a weight corresponding to each topic. When 25 topics are extracted, 25 word topic weights are calculated for each word. Since each topic is of SVD dimension, word topic weights represent the coordinate value of the word in SVD space. 7 is an exemplary diagram showing topics extracted in relation to a stress factor of a woman in her 30s and weights of the topics.

도 7에서 단위 컷오프(term cutoff)는 단어가 토픽에 속하는지 여부를 결정하는 임계 점수(threshold score)를 의미하며, 문서 컷오프(document cutoff)는 문서가 토픽에 속하는지 여부를 결정하는 임계 점수를 의미한다. 단어와 문서는 이러한 컷오프들보다 큰 단어 토픽 가중치(term topic weight)와 문서 토픽 가중치(document topic weight)를 가질 경우 토픽으로 할당된다(topic labeling).In FIG. 7, a term cutoff refers to a threshold score that determines whether a word belongs to a topic, and a document cutoff refers to a threshold score that determines whether a document belongs to a topic. it means. Words and documents are assigned as topics (topic labeling) when they have a word topic weight and a document topic weight greater than these cutoffs.

토픽 기반 네트워크 분석부(130)는 연령별로 타겟들을 시각화하기 위해 토픽 기반 소셜 네트워크 분석(topic-based SNA)을 수행한다. 토픽 기반 네트워크 분석부(130)는 토픽들(topics) 사이의 관계적 속성들(relational attributes)과 관련된 유사도(similarity)를 기초로 토픽들의 상호 관련성(relevance)을 식별하는 기능도 수행한다.The topic-based network analysis unit 130 performs a topic-based social network analysis (topic-based SNA) to visualize targets by age. The topic-based network analysis unit 130 also performs a function of identifying the relevance of topics based on similarity related to relational attributes between topics.

소셜 네트워크 분석(SNA)은 네트워크들과 그래프 이론(graph theory)을 토대로 사회 구조들을 조사하는(investigating) 프로세스를 말한다. 소셜 네트워크 분석은 노드들(네트워크 내에 구비되는 개인 행위자들(individual actors), 사람들, 사물들 등), 유대 관계(ties), 엣지들(edges) 또는 그것들을 연결시키는 링크들(관계들(relationships), 상호작용들(interactions) 등)에 대하여 네트워크 구조를 특성화한다. 소셜 네트워크 분석은 사람들, 단체들, 조직들, 국가들 또는 인간 활동(human activity)을 수행하는 자 등 사회적 행위자들(social actors)의 집합 내에서의 유대 관계 구조(structure of ties)에 초점을 두는 특성을 가지고 있다.Social Network Analysis (SNA) refers to the process of investigating social structures based on networks and graph theory. Social network analysis includes nodes (individual actors, people, things, etc.), ties, edges, or links connecting them (relationships). , Interactions, etc.). Social network analysis focuses on the structure of ties within a set of social actors, such as people, groups, organizations, countries or persons performing human activity. It has characteristics.

토픽 기반 네트워크 분석부(130)는 타겟(ex. 여성들의 스트레스 요인)을 설명하기 위해 스트레스 요인들의 집합 내에서 그것들의 관계 구조에 중점을 두는 토픽 기반 소셜 네트워크 분석(topic-based SNA)을 수행한다. 이러한 토픽 기반 네트워크 분석부(130)는 타겟들 사이의 관계들에 대한 탐색(exploration) 및 타겟(ex. 스트레스 요인)들의 종합적인 패턴들(overall patterns)에 대한 발견(discovery)을 가능하게 하기 위해 토픽 기반 네트워크(topic-based network)를 시각화하는 기능을 수행할 수 있다.The topic-based network analysis unit 130 performs a topic-based social network analysis (topic-based SNA) that focuses on their relationship structure within a set of stress factors in order to explain a target (ex. stress factors of women). . This topic-based network analysis unit 130 is to enable discovery of relationships between targets and overall patterns of targets (ex. stress factors). Visualizing a topic-based network may be performed.

토픽 기반 네트워크 분석부(130)는 토픽들을 정제시키고(refining) 처리하기(arranging) 위한 유용한 도구로 세가지 분석 절차들(analytic produres)을 차례대로 수행한다.The topic-based network analysis unit 130 is a useful tool for refining and arranging topics and sequentially performs three analytic produres.

첫번째 분석 절차로, 토픽 기반 네트워크 분석부(130)는 코사인 유사도 계수를 이용하여 그린 네트워크(세번째 절차에 해당함)가 종합적인 패턴들을 발견하고 시각화하는 기능을 수행합니다.As a first analysis procedure, the topic-based network analysis unit 130 performs a function of discovering and visualizing comprehensive patterns of the green network (corresponding to the third procedure) using the cosine similarity coefficient.

토픽 기반 네트워크 분석부(130)는 이를 위해 토픽 T_x와 T_y 사이의 유사도(similarity)를 결정하기 위해 코사인 유사도들(cosine similarities)을 계산한다. 이를 수학식으로 나타내면 다음과 같다.To this end, the topic-based network analysis unit 130 calculates cosine similarities to determine similarities between topics T _x and T _y . This can be expressed as an equation as follows.

상기에서 tft는 토픽에서 단어(키워드)의 빈도(term(keyword) frequency)를 의미한다. 각각의 토픽들에 대하여, 유사도는 특정 샘플에 포함된 특정 토픽과 다른 샘플들에 포함된 다른 토픽들 사이에서 계산된다.In the above, tft means a term (keyword) frequency of a word (keyword) in a topic. For each of the topics, the degree of similarity is calculated between a specific topic included in a specific sample and other topics included in other samples.

두번째 분석 절차로, 토픽 기반 네트워크 분석부(130)는 노드들(일련화된 토픽들(serialized topics)의 집합)과 링크들(노드들 사이의 각도들(degrees)의 집합)로 표현되는 토픽 기반 소셜 네트워크 분석을 수행한다. 토픽 기반 네트워크 분석부(130)는 이를 통해 소시오매트릭스(sociomatrix) X(N, N)을 생성할 수 있다. 소시오매트릭스 X(N, N)은 사회적 네트워크 데이터(social network data)의 베이직 매트릭스(basic matrix) 타입으로서, 데일리 토픽들(daily topics)과 코사인 유사도들을 이용하여 N * N의 형태로 생성될 수 있다. 도 8은 토픽 기반 소셜 네트워크 분석을 위한 소시오매트릭스의 예시도이다.As a second analysis procedure, the topic-based network analysis unit 130 is a topic-based, expressed by nodes (a set of serialized topics) and links (a set of degrees between nodes). Perform social network analysis. The topic-based network analysis unit 130 may generate a sociomatrix X(N, N) through this. Sociomatrix X (N, N) is a basic matrix type of social network data, and can be generated in the form of N * N using daily topics and cosine similarities. have. 8 is an exemplary diagram of a sociomatrix for topic-based social network analysis.

세번째 분석 절차로, 토픽 기반 네트워크 분석부(130)는 가중치 네트워크(weighted network)를 구성하는 기능을 수행한다. 가중치 네트워크는 가중치가 부여된 노드의 연결 정도 중심성(weighted degree centrality)으로 구성된다. 이 값은 연결된 링크들의 가중치들의 합으로 정의되며, 본 발명에서는 노드의 크기(node size)로 이용된다.As a third analysis procedure, the topic-based network analysis unit 130 performs a function of configuring a weighted network. The weighted network is composed of weighted degree centrality of nodes to which weight is assigned. This value is defined as the sum of the weights of the connected links, and is used as the node size in the present invention.

토픽 기반 네트워크 분석부(130)는 다음 수학식을 이용하여 가중치가 부여된 노드의 연결 정도 중심성을 산출할 수 있다.The topic-based network analysis unit 130 may calculate the connection degree centrality of the nodes to which the weight is assigned using the following equation.

상기에서 w는 가중치가 부여된 인접 행렬(weighted adjacency matrix)을 의미하며, w_ij는 노드 i가 노드 j에 연결될 때 0보다 큰 값을 가진다.In the above, w denotes a weighted adjacency matrix, and w _ij has a value greater than 0 when node i is connected to node j.

토픽 기반 네트워크 분석부(130)는 토픽 기반 소셜 네트워크 분석을 수행할 때 토픽들을 연령별로 시각화하여 상기한 기능을 수행할 수 있다. 이하 이에 대해 설명한다.When performing a topic-based social network analysis, the topic-based network analysis unit 130 may visualize topics by age and perform the above function. This will be described below.

도 9는 토픽으로 라벨링된 연령별 스트레스 요인들에 대한 테이블도이다. 토픽 기반 네트워크 분석부(130)는 도 9에 정렬되어 있는 토픽들(스트레스 요인들)을 기초로 도 10에 도시된 바와 같이 노드들(토픽들)을 미리 정해진 기준에 따라 배열하여 시각화할 수 있다. 토픽 기반 네트워크 기반 분석부(130)는 스트레스 요인과 관련된 토픽들의 구조를 분석하기 위해 수학식 5를 기초로 코사인 유사도들을 계산한 다음, 소시오매트릭스를 생성하여 토픽 기반 소셜 네트워크 분석의 입력 데이터로 이용할 수 있다.9 is a table diagram of stress factors according to ages labeled as topics. The topic-based network analysis unit 130 may visualize by arranging nodes (topics) according to a predetermined criterion as shown in FIG. 10 based on the topics (stress factors) arranged in FIG. 9. . The topic-based network-based analysis unit 130 calculates cosine similarities based on Equation 5 in order to analyze the structure of topics related to the stress factor, and then generates a sociomatrix to be used as input data for topic-based social network analysis. I can.

토픽 기반 네트워크 분석부(130)는 노드들을 도 10에 도시된 바와 같이 시각화할 때에 연령별로 노드들을 서로 다른 축에 배열하여 시각화할 수 있다. 일례로 토픽 기반 네트워크 분석부(130)는 Y+ 축, X+ 축, Y- 축 및 X- 축에 20대와 관련된 토픽들, 30대와 관련된 토픽들, 40대와 관련된 토픽들 및 50대와 관련된 토픽들을 배열할 수 있다.When visualizing nodes as shown in FIG. 10, the topic-based network analysis unit 130 may visualize nodes by arranging them on different axes for each age. For example, the topic-based network analysis unit 130 includes topics related to 20s, topics related to 30s, topics related to 40s, and 50s on the Y+ axis, X+ axis, Y-axis and X-axis. You can arrange topics.

토픽 기반 네트워크 분석부(130)는 동일한 토픽에 대해서는 동일 색상을 적용하고, 각 토픽의 가중치(ex. 스트레스 정도)에 따라 노드의 크기를 조절하여 그래프 상에서 연령별 타겟(ex. 스트레스 요인)들을 가시적으로 디스플레이할 수 있다. 또한 토픽 기반 네트워크 분석부(130)는 0을 시작점으로 하여 연도 순서에 따라 노드들을 배열하여 시각화하는 것도 가능하다. 또한 토픽 기반 네트워크 분석부(130)는 동일 색상을 가지는 노드들을 서로 연결하여 특정 타겟이 연령별로 어느 시기에 많이 나타나는지, 즉 다른 연령과의 관계를 시각화하는 것도 가능하다.The topic-based network analysis unit 130 applies the same color to the same topic, adjusts the size of the node according to the weight (ex. stress level) of each topic, and visually displays targets for each age (ex. stress factor) on the graph. Can be displayed. In addition, the topic-based network analysis unit 130 may visualize by arranging nodes according to year order with 0 as a starting point. In addition, the topic-based network analysis unit 130 may connect nodes having the same color to each other to visualize the relationship between different ages and when certain targets appear frequently by age.

토픽 기반 네트워크 분석부(130)는 도 11 및 도 12에 도시된 바와 같이 특정 연령과 관련된 노드들을 연도 순서에 따라 원형으로 배열하여 타겟(ex. 스트레스 요인)들을 시각화하는 것도 가능하다. 도 11 및 도 12는 각각 30대 및 40대와 관련된 스트레스 요인들을 연도 순서에 따라 원형으로 배열하여 시각화한 것들이다.As shown in FIGS. 11 and 12, the topic-based network analysis unit 130 may visualize targets (ex. stress factors) by arranging nodes related to a specific age in a circular shape according to the year order. 11 and 12 are visualizations of stress factors related to their 30s and 40s, respectively, arranged in a circular shape according to the year order.

도 11의 (a)는 30대와 관련된 모든 스트레스 요인들을 나타낸 것이고, 도 11의 (b)는 30대와 관련된 모든 스트레스 요인들 중에서 가중치가 임계값 이상인 스트레스 요인들을 선별하여 나타낸 것이다. 마찬가지로 도 12의 (a)는 40대와 관련된 모든 스트레스 요인들을 나타낸 것이고, 도 12의 (b)는 40대와 관련된 모든 스트레스 요인들 중에서 가중치가 임계값 이상인 스트레스 요인들을 선별하여 나타낸 것이다.FIG. 11(a) shows all the stress factors related to the thirties, and FIG. 11(b) shows the stress factors having a weight greater than or equal to the threshold value among all the stress factors related to the thirties. Similarly, (a) of FIG. 12 shows all the stress factors related to the 40s, and (b) of FIG. 12 shows the stress factors having a weight greater than or equal to the threshold value among all the stress factors related to the 40s.

도 11 및 도 12에서 노드는 나이(age)_연도(year)_내용(detailed topic)의 형태로 나타낼 수 있다. 일례로 2018년도 35세의 토픽으로 work life balance가 있는 경우, 이 노드를 도 11의 (a) 및 (b)에 35_2018_work life balance의 형태로 나타낼 수 있다.In FIGS. 11 and 12, the node may be represented in the form of age_year_detailed topic. For example, when there is a work life balance as a topic of the age of 35 in 2018, this node may be represented in the form of 35_2018_work life balance in (a) and (b) of FIG. 11.

도 11의 (a)에 따르면, 30대의 스트레스 요인으로 가족(Family, 27.27%)이 가장 높았으며, 그 다음으로 직장(Work, 24.24%), 결혼(Marriage, 15.15%) 등이 뒤를 이었다. 이로부터 30대의 스트레스 요인들은 결혼을 통해 새로운 가족을 구성하고 자녀 양육, 부모 부양 등을 위해 일을 하는 것과 관련된다는 것을 알 수 있다. 스트레스 요인으로 사랑(Love)의 경우, 20대에서는 40.91%였는데, 30대에서는 3.03%로 크게 감소한 것으로 보아, 30대에서는 결혼 적령기가 되어 연애보다는 결혼과 관련된 스트레스가 주요 스트레스로 바뀌었음을 알 수 있다.According to (a) of FIG. 11, as a stress factor in their 30s, family (27.27%) was the highest, followed by work (24.24%) and marriage (Marriage, 15.15%). From this, it can be seen that the stressors in their 30s are related to forming a new family through marriage, raising children, and working to support parents. In the case of love, as a stress factor, it was 40.91% in their 20s, but significantly decreased to 3.03% in their 30s, indicating that in their 30s, the marriage-related stress turned into a major stress rather than dating as it became a marriageable age. have.

도 11의 (b)에 따르면, 노드 1(14_4_mother in law)이 노드 2(16_5_parenting) 및 노드 3(09_3_work life balance)이 상호 연결되어 있다. 이것은 30대의 여성이 결혼 후 일과 생활(자녀 양육)을 병행하면서 워크 라이프 밸런스(work life balance)를 유지하기 어렵고, 시어머니 등 가족들이 대신 자녀를 양육하면서 발생되는 갈등이 30대에서 큰 스트레스 요인으로 자리잡고 있음을 나타낸다.Referring to FIG. 11B, node 1 (14_4_mother in law), node 2 (16_5_parenting) and node 3 (09_3_work life balance) are interconnected. This is because it is difficult for a woman in her 30s to maintain a work life balance while working and life (child rearing) after marriage, and conflicts that arise when family members such as mother-in-law raise children instead are a major stress factor in their 30s. It indicates that you are holding it.

이상 도 1 내지 도 12를 참조하여 본 발명의 일실시 형태에 대하여 설명하였다. 이하에서는 이러한 일실시 형태로부터 추론 가능한 본 발명의 바람직한 형태에 대하여 설명한다.One embodiment of the present invention has been described above with reference to FIGS. 1 to 12. Hereinafter, a preferred embodiment of the present invention that can be inferred from such an embodiment will be described.

도 13은 본 발명의 바람직한 실시예에 따른 타겟 분석 장치의 내부 구성을 개략적으로 도시한 개념도이다.13 is a conceptual diagram schematically showing the internal configuration of a target analysis device according to a preferred embodiment of the present invention.

도 13에 따르면, 타겟 분석 장치(300)는 메시지 처리부(310), 키워드 검출부(320), 타겟 도출부(330), 노드 표시 제어부(340), 전원부(350) 및 주제어부(360)를 포함한다.According to FIG. 13, the target analysis device 300 includes a message processing unit 310, a keyword detection unit 320, a target derivation unit 330, a node display control unit 340, a power supply unit 350, and a main control unit 360. do.

전원부(350)는 타겟 분석 장치(300)를 구성하는 각 구성에 전원을 공급하는 기능을 수행한다.The power supply unit 350 performs a function of supplying power to each component constituting the target analysis device 300.

주제어부(360)는 타겟 분석 장치(300)를 구성하는 각 구성의 전체 작동을 제어하는 기능을 수행한다.The main control unit 360 performs a function of controlling the entire operation of each component constituting the target analysis device 300.

메시지 처리부(310)는 이벤트와 관련된 텍스트 메시지들을 수집하여 전처리하는 기능을 수행한다. 메시지 처리부(310)는 도 1의 데이터 수집 및 전처리부(110)에 대응하는 개념이다.The message processing unit 310 performs a function of collecting and pre-processing text messages related to an event. The message processing unit 310 is a concept corresponding to the data collection and preprocessing unit 110 of FIG. 1.

메시지 처리부(310)는 텍스트 메시지들에 포함되어 있는 문장들로부터 의미를 내포하고 있는 단어들을 추출하고, 이 단어들에 대해 가중치들을 부여하며, 가중치들이 부여된 단어들을 정규화시켜 텍스트 메시지들을 전처리할 수 있다.The message processing unit 310 extracts words containing meaning from sentences included in text messages, assigns weights to these words, and normalizes words to which weights are assigned to preprocess text messages. have.

메시지 처리부(310)는 단어들의 품사를 식별하여 태그를 붙이는 토큰화(tokenization)를 이용하여 문장들로부터 단어들을 추출할 수 있다.The message processing unit 310 may extract words from sentences using tokenization in which tags are attached by identifying parts of speech of words.

메시지 처리부(310)는 하나의 텍스트 메시지에 출현하는 각 단어의 제1 빈도 및 텍스트 메시지들 전체에 출현하는 각 단어의 제2 빈도를 기초로 단어들에 가중치들을 부여할 수 있다.The message processing unit 310 may assign weights to words based on a first frequency of each word appearing in one text message and a second frequency of each word appearing in all text messages.

메시지 처리부(310)는 각 단어에 부여된 가중치 및 텍스트 메시지들의 개수를 기초로 단어들을 정규화시킬 수 있다.The message processing unit 310 may normalize words based on a weight assigned to each word and the number of text messages.

키워드 검출부(320)는 의미(meaning)를 내포하고 있는 것으로서 텍스트 메시지들에 포함되어 있는 단어(term)들에 부여된 가중치들을 기초로 텍스트 메시지들과 관련된 문서들로부터 토픽(topic)들을 키워드로 검출하는 기능을 수행한다. 키워드 검출부(320)는 도 1의 토픽 추출부(120)에 대응하는 개념이다.The keyword detection unit 320 detects topics as keywords from documents related to text messages based on weights assigned to terms included in text messages as having meaning. Performs the function of The keyword detection unit 320 is a concept corresponding to the topic extraction unit 120 of FIG. 1.

키워드 검출부(320)는 단어들 사이의 관계와 관련된 제1 행렬, 문서들 사이의 관계와 관련된 제2 행렬 및 단어들과 문서들 사이의 관계와 관련된 제3 행렬을 이용하는 SVD(Singular Value Decomposition) 및 각 단어의 제1 빈도를 이용하는 TSVD(Truncated SVD)를 기초로 토픽들을 키워드로 검출할 수 있다.The keyword detection unit 320 uses SVD (Singular Value Decomposition) using a first matrix related to the relationship between words, a second matrix related to the relationship between documents, and a third matrix related to the relationship between words and documents, and Topics may be detected as keywords based on TSVD (Truncated SVD) using the first frequency of each word.

키워드 검출부(320)는 가중치들과 관련된 컷오프(cutoff)를 기초로 단어들 중에서 토픽들을 검출할 수 있다.The keyword detector 320 may detect topics among words based on cutoffs related to weights.

키워드 검출부(320)는 문서들에 SVD, TSVD 및 배리맥스 회전(varimax rotation)을 차례대로 적용한 후 컷오프를 기초로 토픽들에 소정의 정보를 레이블링(labeling)하여 검출할 수 있다.The keyword detector 320 may sequentially apply SVD, TSVD, and varimax rotation to documents, and then detect by labeling predetermined information on topics based on cutoffs.

타겟 도출부(330)는 토픽들에 대하여 소셜 네트워크 분석(Social Network Analysis)을 실행하여 연령별로 타겟들을 도출하는 기능을 수행한다. 타겟 도출부(330)는 도 1의 토픽 기반 네트워크 분석부(130)에 대응하는 개념이다.The target derivation unit 330 performs a function of deriving targets for each age by executing a social network analysis on topics. The target derivation unit 330 is a concept corresponding to the topic-based network analysis unit 130 of FIG. 1.

타겟 도출부(330)는 토픽들의 제1 빈도를 기초로 토픽들 사이의 유사도와 관련된 코사인 유사도(cosine similarity)를 계산하며, 이 코사인 유사도를 기초로 소셜 네트워크 분석을 실행하여 타겟들을 도출할 수 있다.The target derivation unit 330 may calculate a cosine similarity related to a similarity between topics based on the first frequency of the topics, and may derive targets by performing a social network analysis based on the cosine similarity. .

타겟 도출부(330)는 토픽들 사이의 유사도와 관련된 코사인 유사도 및 토픽들 사이의 관계 정보를 기초로 소시오매트릭스(sociomatrix)를 생성하며, 이 소시오매트릭스를 기초로 소셜 네트워크 분석을 실행하여 타겟들을 도출할 수 있다.The target derivation unit 330 generates a sociomatrix based on the cosine similarity related to the similarity between the topics and the relationship information between the topics, and performs a social network analysis based on the sociomatrix. Can be derived.

타겟 분석 장치(300)는 노드 표시 제어부(340)를 더 포함할 수 있다.The target analysis device 300 may further include a node display control unit 340.

노드 표시 제어부(340)는 그래프 상의 각 축에 각 연령별로 타겟들과 관련된 노드들을 연도 순서에 따라 배열하여 노드들을 가시적으로 표시하는 기능을 수행한다. 노드 표시 제어부(340)는 도 1의 토픽 기반 네트워크 분석부(130)에 대응하는 개념이다.The node display control unit 340 performs a function of visually displaying nodes by arranging nodes related to targets for each age on each axis of the graph according to the year order. The node display control unit 340 is a concept corresponding to the topic-based network analysis unit 130 of FIG. 1.

노드 표시 제어부(340)는 동일한 타겟에 대해 동일한 색상의 노드를 적용하고, 각 토픽과 관련된 가중치를 기초로 노드의 크기를 조절하여 노드들을 그래프 상에 가시적으로 표시할 수 있다.The node display controller 340 may visually display the nodes on a graph by applying nodes of the same color to the same target and adjusting the size of the nodes based on weights related to each topic.

다음으로 스트레스 요인 분석 장치(300)의 작동 방법에 대하여 설명한다.Next, a method of operating the stress factor analysis device 300 will be described.

먼저 메시지 처리부(310)는 이벤트와 관련된 텍스트 메시지들을 수집하여 전처리한다(STEP A).First, the message processing unit 310 collects and preprocesses text messages related to the event (STEP A).

이후 키워드 검출부(320)는 의미(meaning)를 내포하고 있는 것으로서 텍스트 메시지들에 포함되어 있는 단어(term)들에 부여된 가중치들을 기초로 텍스트 메시지들과 관련된 문서들로부터 토픽(topic)들을 키워드로 검출한다(STEP B).Thereafter, the keyword detection unit 320 uses the weights given to terms included in the text messages as keywords that contain meanings from documents related to text messages. Detect (STEP B).

이후 타겟 도출부(330)는 토픽들에 대하여 소셜 네트워크 분석(Social Network Analysis)을 실행하여 연령별로 타겟들을 도출한다(STEP C).Thereafter, the target derivation unit 330 derives targets for each age by executing a social network analysis on the topics (STEP C).

한편 STEP C 단계 이후, 노드 표시 제어부(340)는 그래프 상의 각 축에 각 연령별로 타겟들과 관련된 노드들을 연도 순서에 따라 배열하여 노드들을 가시적으로 표시할 수 있다(STEP D).Meanwhile, after STEP C, the node display control unit 340 may visually display nodes by arranging nodes related to targets for each age on each axis on the graph according to the year order (STEP D).

이상에서 설명한 본 발명의 실시예를 구성하는 모든 구성요소들이 하나로 결합하거나 결합하여 동작하는 것으로 기재되어 있다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 또한, 이와 같은 컴퓨터 프로그램은 USB 메모리, CD 디스크, 플래쉬 메모리 등과 같은 컴퓨터가 읽을 수 있는 기록매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 기록매체로서는 자기 기록매체, 광 기록매체 등이 포함될 수 있다.Even if all the components constituting the embodiments of the present invention described above are described as being combined into one or operating in combination, the present invention is not necessarily limited to these embodiments. That is, within the scope of the object of the present invention, all the constituent elements may be selectively combined and operated in one or more. In addition, although all of the components may be implemented as one independent hardware, a program module that performs some or all functions combined in one or more hardware by selectively combining some or all of the components. It may be implemented as a computer program having In addition, such a computer program is stored in a computer readable media such as a USB memory, a CD disk, a flash memory, etc., and is read and executed by a computer, thereby implementing an embodiment of the present invention. The recording medium of the computer program may include a magnetic recording medium or an optical recording medium.

또한, 기술적이거나 과학적인 용어를 포함한 모든 용어들은, 상세한 설명에서 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In addition, all terms, including technical or scientific terms, have the same meaning as commonly understood by those of ordinary skill in the art, unless otherwise defined in the detailed description. Terms generally used, such as terms defined in the dictionary, should be interpreted as being consistent with the meaning of the context of the related technology, and are not interpreted as ideal or excessively formal meanings unless explicitly defined in the present invention.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구 범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the technical field to which the present invention belongs can make various modifications, changes, and substitutions within the scope not departing from the essential characteristics of the present invention. will be. Accordingly, the embodiments disclosed in the present invention and the accompanying drawings are not intended to limit the technical idea of the present invention, but are for illustration, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings. . The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

Claims

A message processing unit that collects and preprocesses text messages related to an event;
A keyword detection unit that has meaning and detects topics as keywords from documents related to the text messages based on weights assigned to terms included in the text messages; And
And a target derivation unit for deriving targets by age by executing a social network analysis on the topics,
The message processing unit extracts the words that have the meaning from sentences included in the text messages, assigns weights to the words, and normalizes the words to which weights are assigned to generate the text messages. Target analysis device, characterized in that pre-processing.

delete

The method of claim 1,
And the message processing unit extracts the words from the sentences using tokenization in which a part of speech of the words is identified and tagged.

The method of claim 1,
The message processing unit assigns weights to the words based on a first frequency of each word appearing in one text message and a second frequency of each word appearing in all of the text messages. .

The method of claim 1,
The message processing unit normalizes the words based on a weight assigned to each word and the number of text messages.

The method of claim 1,
The keyword detection unit SVD (Singular Value Decomposition) using a first matrix related to the relationship between the words, a second matrix related to the relationship between the documents, and a third matrix related to the relationship between the words and the documents. ) And a TSVD (Truncated SVD) using the first frequency of each word, and detecting the topics as keywords.

The method of claim 1,
And the keyword detection unit detects the topics among the words based on a cutoff related to the weights.

The method of claim 7,
The keyword detection unit sequentially applies SVD, TSVD and varimax rotation to the documents, and then detects by labeling predetermined information on the topics based on the cutoff. Device.

The method of claim 1,
The target derivation unit calculates a cosine similarity related to the similarity between the topics based on a first frequency of the topics, and performs the social network analysis based on the cosine similarity to derive the targets. Target analysis device characterized in that.

The method of claim 1,
The target derivation unit generates a sociomatrix based on the cosine similarity related to the similarity between the topics and the relationship information between the topics, and executes the social network analysis based on the sociomatrix. Target analysis device, characterized in that deriving targets.

The method of claim 1,
A node display control unit that visually displays the nodes by arranging nodes related to targets for each age on each axis on the graph in order of year
Target analysis device, characterized in that it further comprises.

The method of claim 11,
The node display control unit applies nodes of the same color to the same target, adjusts a size of a node based on a weight associated with each topic, and displays the nodes visually on a graph.

Pre-processing by a message processing unit collecting text messages related to an event;
The keyword detection unit detects topics as keywords from documents related to the text messages based on weights given to terms included in the text messages as having meaning. step; And
A target derivation unit deriving targets for each age by performing a social network analysis on the topics; includes,
In the pre-processing, the words containing the meaning are extracted from sentences included in the text messages, weights are assigned to the words, and the words to which weights are assigned are normalized to the text message. Target analysis method, characterized in that pre-processing.

delete

The method of claim 13,
In the deriving step, a cosine similarity related to a similarity between the topics is calculated based on a first frequency of the topics, and the targets are derived by executing the social network analysis based on the cosine similarity. Target analysis method, characterized in that.

The method of claim 13,
The node display control unit visually displays the nodes by arranging the nodes related to targets for each age on each axis on the graph according to the year order
Target analysis method further comprising a.

A computer program stored in a computer-readable medium for executing the target analysis method according to any one of claims 13, 15 and 16 on a computer.