KR102492881B1

KR102492881B1 - Classification Dictionary Building Method Per Subject of Text Data and Storage Medium Recording Program for Executing the Same

Info

Publication number: KR102492881B1
Application number: KR1020220124944A
Authority: KR
Inventors: 류승완
Original assignee: 주식회사 알에스엔
Priority date: 2020-11-26
Filing date: 2022-09-30
Publication date: 2023-01-30
Also published as: KR20220073498A; KR20220137603A; KR102463216B1

Abstract

본 발명에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법은, 저장매체에 저장된 텍스트 데이터에 대한 주제 별 분류 사전 구축용 프로그램이 설치된 관리서버를 통해 수행되는 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, 온라인 서비스로부터 텍스트 데이터를 추출하는 (a)단계, 상기 (a)단계에 의해 추출된 상기 텍스트 데이터에 포함된 문장을 기 설정된 단어 단위로 분할 및 추출하는 (b)단계, 상기 (b)단계에 의해 분할된 단어를 전처리하는 (c)단계, 상기 (c)단계에 의해 전처리된 단어를 분석하여 상기 텍스트 데이터의 주제명을 도출하는 (d)단계 및 상기 (a)단계 내지 상기 (d)단계의 과정을 초기 분류체계로서 저장하는 (e)단계를 포함한다.A subject-specific classification dictionary construction method for text data according to the present invention is a subject-specific classification dictionary construction method for text data performed through a management server in which a subject-specific classification dictionary construction program for text data stored in a storage medium is installed. In the step (a) of extracting text data from an online service, the step of (b) dividing and extracting sentences included in the text data extracted by step (a) into predetermined word units, and the step of (b) Step (c) of pre-processing the words divided by the step, step (d) of deriving subject names of the text data by analyzing the words pre-processed by step (c), and steps (a) to (d) It includes a step (e) of storing the process of the step as an initial classification system.

Description

Classification Dictionary Building Method Per Subject of Text Data and Storage Medium Recording Program for Executing the Same}

본 발명은 텍스트 데이터에 대한 주제 별 분류 사전 구축방법 및 이를 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 저장매체에 관한 것으로서, 보다 상세하게는 텍스트 데이터의 주제와 연관성이 높은 분류 패턴을 추천하여 사전 구축을 용이하게 할 수 있도록 제공되는 텍스트 데이터에 대한 주제 별 분류 사전 구축방법 및 이를 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 저장매체에 관한 것이다.The present invention relates to a method for constructing a subject-specific classification dictionary for text data and a computer-readable storage medium recording a program for executing the same, and more particularly, to a dictionary by recommending a classification pattern highly related to the subject of text data. It relates to a subject-specific classification dictionary construction method for text data provided to facilitate construction and a computer-readable storage medium recording a program for executing it.

최근에는 온라인 상에 존재하는 데이터의 양이 방대해짐에 따라 텍스트로 이루어진 빅데이터를 텍스트 마이닝하여 분류 사전을 구축하는 작업이 중요해지고 있다.Recently, as the amount of data existing online has become vast, the task of constructing a classification dictionary by text mining big data consisting of text has become important.

종래에는, 분류 사전을 구축하기 위해 사람이 직접 대용량 텍스트 데이터를 샘플링하여 처음부터 끝까지 읽어 가며 데이터 분류에 해당하는 단어를 일일이 추출하는 방식을 사용하였다.Conventionally, in order to build a classification dictionary, a method of manually extracting words corresponding to data classification by sampling large-volume text data and reading it from beginning to end was used.

하지만, 이와 같이 사람이 직접 수작업을 통해 분류 사전을 구축하는 방식은 방대한 양의 데이터를 모두 반영하기 위해 많은 인건비가 발생하는 것은 물론, 작업자의 주관에 따라 전혀 다른 분류 사전이 구축된다는 문제가 있다.However, such a method of manually constructing a classification dictionary by a person has a problem in that a large amount of labor costs are incurred to reflect all of the vast amount of data, and a completely different classification dictionary is built according to the operator's subjectivity.

뿐만 아니라, 주제에 맞지 않는 패턴이 분류 사전에 포함될 가능성이 높아 그로 인해 잘못된 분류체계가 만들어지고, 데이터의 분류 정확도가 크게 떨어지는 문제가 존재하게 된다.In addition, patterns that do not fit the subject are likely to be included in the classification dictionary, resulting in an incorrect classification system and a problem in which classification accuracy of data is greatly reduced.

더 나아가, 최근 기하급수적으로 증가하고 있는 최신 데이터의 양을 사람이 확인하기에는 명확한 한계가 있어 분류 사전에 최신 트렌드를 반영하기가 어렵다는 문제가 있다.Furthermore, there is a problem in that it is difficult to reflect the latest trends in the classification dictionary because there is a clear limit for a person to check the amount of latest data, which is recently increasing exponentially.

따라서 이와 같은 문제점들을 해결하기 위한 방법이 요구된다.Therefore, a method for solving these problems is required.

한국공개특허 제10-2014-0010930호Korean Patent Publication No. 10-2014-0010930

본 발명은 상술한 종래 기술의 문제점을 해결하기 위하여 안출된 발명으로서, 텍스트 데이터의 주제와 연관성이 높은 분류 패턴을 추천하여 사전 구축을 용이하게 할 수 있는 텍스트 데이터에 대한 주제 별 분류 사전 구축방법을 제공하기 위한 목적을 가진다.The present invention has been made to solve the above-mentioned problems of the prior art, and provides a method for constructing a classification dictionary by subject for text data that can facilitate dictionary construction by recommending a classification pattern highly related to the subject of text data. has a purpose to provide

본 발명의 과제들은 이상에서 언급한 과제들로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The tasks of the present invention are not limited to the tasks mentioned above, and other tasks not mentioned will be clearly understood by those skilled in the art from the following description.

상기한 목적을 달성하기 위한 본 발명의 텍스트 데이터에 대한 주제 별 분류 사전 구축방법은, 저장매체에 저장된 텍스트 데이터에 대한 주제 별 분류 사전 구축용 프로그램이 설치된 관리서버를 통해 수행되는 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, 온라인 서비스로부터 텍스트 데이터를 추출하는 (a)단계, 상기 (a)단계에 의해 추출된 상기 텍스트 데이터에 포함된 문장을 기 설정된 단어 단위로 분할 및 추출하는 (b)단계, 상기 (b)단계에 의해 분할된 단어를 전처리하는 (c)단계, 상기 (c)단계에 의해 전처리된 단어를 분석하여 상기 텍스트 데이터의 주제명을 도출하는 (d)단계 및 상기 (a)단계 내지 상기 (d)단계의 과정을 초기 분류체계로서 저장하는 (e)단계를 포함한다.In order to achieve the above object, the subject-specific classification dictionary construction method for text data of the present invention is a subject for text data that is performed through a management server in which a subject-specific classification dictionary construction program for text data stored in a storage medium is installed. In the method for constructing a separate classification dictionary, (a) extracting text data from an online service, and (b) dividing and extracting sentences included in the text data extracted by the step (a) into preset word units. Step (c) of pre-processing the words divided by step (b), step (d) of deriving the subject title of the text data by analyzing the words pre-processed by step (c), and step of (a) and (e) storing the process of steps to (d) as an initial classification system.

이때 상기 (b)단계는, 상기 텍스트 데이터에 포함된 문장을 어절 별로 분할하는 (b-1)단계 및 상기 (b-1)단계에 의해 분할된 어절에서 명사군, 동사군 및 형용사군에 해당하는 단어만을 분리하여 추출하는 (b-2)단계를 포함할 수 있다.In this case, the step (b) corresponds to the step (b-1) of dividing the sentence included in the text data by word and the group of nouns, verbs, and adjectives in the words divided by the step (b-1). It may include a step (b-2) of separating and extracting only the word.

그리고 상기 (c)단계는, 상기 (b)단계에 의해 추출된 단어 n개를 조합하여 조합그룹을 구성하는 (c-1)단계, 상기 조합그룹에 포함된 각 단어의 주제 지수를 추출하는 (c-2)단계 및 상기 조합그룹에 포함된 각 단어에 대한 가중치를 부여하는 (c-3)단계 포함할 수 있다.In the step (c), the step (c-1) of configuring a combination group by combining the n words extracted in the step (b), extracting the subject index of each word included in the combination group ( Step c-2) and step (c-3) of assigning a weight to each word included in the combination group may be included.

여기서 상기 (c-3)단계는, 상기 조합그룹이 명사와 형용사를 포함하고 있을 경우, 해당 조합그룹의 명사에 추가적인 가중치를 부여하도록 할 수 있다.Here, in step (c-3), when the combination group includes a noun and an adjective, additional weight may be given to the noun of the corresponding combination group.

또는 상기 (c-3)단계는, 상기 조합그룹에 포함된 형용사가 기 설정되어 분류된 강조 형용사일 경우, 해당 형용사에 추가적인 가중치를 부여하도록 할 수 있다.Alternatively, in step (c-3), when the adjectives included in the combination group are pre-set and classified emphasized adjectives, additional weight may be given to the adjectives.

그리고 상기 (d)단계는, 상기 (c)단계에 의해 전처리된 임의의 단어가 상기 텍스트 데이터 내에서 노출되는 횟수를 산출하는 (d-1)단계, 상기 텍스트 데이터를 구성하는 단어들이 상기 (d-1)단계의 임의의 단어와 동일한 주제를 나타내는 비율을 산출하는 (d-2)단계, 상기 (d-1)단계 및 상기 (d-2)단계의 결과값을 곱하여 주제 지수를 산출하는 (d-3)단계 및 상기 (d-3)단계의 주제 지수를 고려하여 상기 텍스트 데이터의 주제명을 도출하는 (d-4)단계를 포함할 수 있다.In the step (d), the step (d-1) of calculating the number of times the random word preprocessed by the step (c) is exposed in the text data, the words constituting the text data (d-2) step of calculating the ratio representing the same topic as the random word in step -1), calculating the topic index by multiplying the result values of the step (d-1) and step (d-2) Step d-3) and step (d-4) of deriving the subject name of the text data in consideration of the subject index of step (d-3).

한편 본 발명은 상기 초기 분류체계가 저장된 이후 새롭게 수행된 (a)단계 내지 (d)단계에 의해 도출된 신규 분류체계를 반영하여 확장 분류체계로 갱신하는 (f)단계를 더 포함할 수 있다.Meanwhile, the present invention may further include a step (f) of updating to an extended classification system by reflecting the new classification system derived by steps (a) to (d) newly performed after the initial classification system is stored.

그리고 상기 (f)단계는, 상기 초기 분류체계에 포함된 텍스트 데이터의 주제명과 상기 신규 분류체계에 포함된 텍스트 데이터의 주제명의 유사도를 측정하는 (f-1)단계 및 상기 (f-1)단계의 유사도 측정 결과에 따라, 상기 신규 분류체계에 포함된 텍스트 데이터의 주제명을 상기 초기 분류체계에 포함된 텍스트 데이터의 주제명과 합산하거나 분리하여 상기 확장 분류체계를 갱신하는 (f-2)단계를 포함할 수 있다.In the step (f), the similarity between the subject name of the text data included in the initial classification system and the subject name of the text data included in the new classification system is measured (f-1) and the step (f-1) (f-2) updating the extended classification system by summing or separating the subject name of the text data included in the new classification system with the subject name of the text data included in the initial classification system according to the similarity measurement result of can do.

이때 상기 (f-1)단계는,At this time, the step (f-1),

의 수식을 통해 상기 초기 분류체계에 포함된 텍스트 데이터의 주제명과 상기 신규 분류체계에 포함된 텍스트 데이터의 주제명의 유사도를 판단할 수 있다.The similarity between the subject name of the text data included in the initial classification system and the subject name of the text data included in the new classification system can be determined through the formula of .

또한 상기 (f-2)단계는, 상기 (f-1)단계의 유사도 판단 결과 0.5 이상의 절대값을 가지는 텍스트 데이터의 주제명은 서로 통합하여 저장하고, 0.5 미만의 절대값을 가지는 텍스트 데이터의 주제명은 신규 분류로 인식하여 분리 저장함에 따라 상기 확장 분류체계를 갱신하도록 할 수 있다.Also, in the step (f-2), as a result of the similarity determination in the step (f-1), the subject names of the text data having an absolute value of 0.5 or more are integrated and stored, and the subject names of the text data having an absolute value of less than 0.5 are stored together. As it is recognized as a new classification and stored separately, the extended classification system can be updated.

더불어 상기 (f-2)단계는, 상기 (f-1)단계의 유사도 판단 결과 0.5 이상의 절대값을 가지는 텍스트 데이터의 주제명이 복수 개인 경우, 가장 높을 절대값을 가지는 텍스트 데이터의 주제명만을 통합하여 저장할 수 있다.In addition, in the step (f-2), if there are multiple subject names of the text data having an absolute value of 0.5 or more as a result of the similarity determination in the step (f-1), only the subject name of the text data having the highest absolute value is integrated and stored. can

*한편 본 발명의 텍스트 데이터에 대한 주제 별 분류 사전 구축방법은 이를 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 저장매체 형태로 제공될 수 있다.* Meanwhile, the subject-specific classification dictionary construction method for text data of the present invention may be provided in the form of a computer-readable storage medium in which a program for executing it is recorded.

상기한 과제를 해결하기 위한 본 발명의 텍스트 데이터에 대한 주제 별 분류 사전 구축방법 및 이를 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 저장매체는, 종래 대용량 텍스트 데이터를 사람이 하나씩 확인하여 분류 체계를 구축하는 과정을 기계적으로 자동화하여 비용을 절감하고, 최신 데이터의 신속한 반영, 일관성 있는 분류 기준을 가지는 사전을 구축할 수 있는 장점을 가진다.To solve the above problems, a method for constructing a subject-specific classification dictionary for text data and a computer-readable storage medium recording a program for executing the same according to the present invention to solve the above problems are conventional large-capacity text data checked one by one by a person to create a classification system. It has the advantage of reducing costs by automating the construction process mechanically, promptly reflecting the latest data, and constructing a dictionary with consistent classification standards.

특히 종래에는 사람이 하나의 분야에 장기간에 걸쳐 수만건의 데이터를 봐야하기 때문에 다양한 분야의 사전을 구축하지 못하던 문제가 있었으나, 본 발명은 자동화를 통한 분류 사전을 구축할 수 있도록 함으로써 다양한 분야에서 더 높은 정확도를 가진 분류 사전을 구축할 수 있으며, 데이터 분류 생산성 증가에 큰 발전을 이룰 수 있는 장점을 가진다.In particular, in the prior art, there was a problem of not being able to build dictionaries in various fields because a person had to see tens of thousands of data in one field over a long period of time. It is possible to build a classification dictionary with accuracy, and has the advantage of making great progress in increasing data classification productivity.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

도 1은 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법의 전체 과정을 나타낸 도면;
도 2는 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (b)단계의 세부 과정을 나타낸 도면;
도 3은 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (c)단계의 세부 과정을 나타낸 도면;
도 4는 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (d)단계의 세부 과정을 나타낸 도면; 및
도 5는 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (f)단계의 세부 과정을 나타낸 도면이다.1 is a diagram showing the entire process of a method for constructing a classification dictionary by subject for text data according to an embodiment of the present invention;
2 is a view showing a detailed process of step (b) in a method for constructing a classification dictionary for each subject for text data according to an embodiment of the present invention;
3 is a view showing a detailed process of step (c) in a method for constructing a classification dictionary for each subject for text data according to an embodiment of the present invention;
4 is a diagram showing a detailed process of step (d) in a method for constructing a dictionary for classification by subject for text data according to an embodiment of the present invention; and
5 is a diagram illustrating a detailed process of step (f) in the method for constructing a dictionary for subject classification of text data according to an embodiment of the present invention.

이하 본 발명의 목적이 구체적으로 실현될 수 있는 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 설명한다. 본 실시예를 설명함에 있어서, 동일 구성에 대해서는 동일 명칭 및 동일 부호가 사용되며 이에 따른 부가적인 설명은 생략하기로 한다.Hereinafter, a preferred embodiment of the present invention in which the object of the present invention can be realized in detail will be described with reference to the accompanying drawings. In describing the present embodiment, the same name and the same reference numeral are used for the same configuration, and additional description thereof will be omitted.

본 발명에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법은 텍스트 데이터에 대한 주제 별 분류 사전 구축용 프로그램이 설치된 저장매체가 구비된 관리서버를 통해 수행되는 것으로서, 관리서버에 설치되어 관리서버의 프로세서에 의해 구동될 수 있다.The subject-specific classification dictionary construction method for text data according to the present invention is performed through a management server equipped with a storage medium in which a subject-specific classification dictionary construction program for text data is installed, which is installed in the management server and is installed in the management server's processor. can be driven by

또한 이에 의해 구동된 텍스트 데이터에 대한 주제 별 분류 사전 구축용 프로그램은 디스플레이 모듈 등 영상 출력장치를 통해 출력될 수 있으며, 시각화된 그래픽 유저 인터페이스를 통해 사용자에게 가시적인 정보를 제공할 수 있다.In addition, the subject-specific classification dictionary construction program for the text data driven thereby may be output through an image output device such as a display module, and may provide visual information to the user through a visualized graphic user interface.

특히 텍스트 데이터에 대한 주제 별 분류 사전 구축용 프로그램이 저장된 저장매체는 이동식 디스크나 통신망을 이용하여 관리서버에 설치될 수 있으며, 텍스트 데이터에 대한 주제 별 분류 사전 구축용 프로그램은 관리서버가 다양한 기능적 수단으로 운용되도록 할 수 있다. 즉 본 발명은 소프트웨어에 의한 정보 처리가 하드웨어를 통해 구체적으로 실현된다.In particular, the storage medium in which the subject-specific classification dictionary construction program for text data is stored can be installed in the management server using a removable disk or a communication network, and the subject-specific classification dictionary construction program for text data is provided by the management server as a variety of functional means. can be made to operate. That is, in the present invention, information processing by software is concretely realized through hardware.

이하에서는 관리서버를 통해 실행되는 본 발명의 텍스트 데이터에 대한 주제 별 분류 사전 구축방법의 알고리즘에 대해 설명하도록 한다.Hereinafter, the algorithm of the subject-specific classification dictionary construction method for text data, which is executed through the management server, of the present invention will be described.

도 1은 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법의 전체 과정을 나타낸 도면이다.1 is a diagram showing the entire process of a method for constructing a classification dictionary by subject for text data according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법은 온라인 서비스로부터 텍스트 데이터를 추출하는 (a)단계와, (a)단계에 의해 추출된 텍스트 데이터에 포함된 문장을 기 설정된 단어 단위로 분할 및 추출하는 (b)단계와, (b)단계에 의해 분할된 단어를 전처리하는 (c)단계와, (c)단계에 의해 전처리된 단어를 분석하여 텍스트 데이터의 주제명을 도출하는 (d)단계와, (a)단계 내지 (d)단계의 과정을 초기 분류체계로서 저장하는 (e)단계를 포함한다.As shown in FIG. 1, a method for constructing a subject-specific classification dictionary for text data according to an embodiment of the present invention includes steps (a) of extracting text data from an online service, and the text extracted by step (a). Step (b) of dividing and extracting sentences included in the data into predetermined word units, step (c) of preprocessing the words divided by step (b), and analyzing the words preprocessed by step (c) and (d) step of deriving the subject name of the text data by using this method, and (e) step of storing the process of steps (a) to (d) as an initial classification system.

(a)단계는 다양한 온라인 서비스로부터 소정의 텍스트 데이터를 추출하는 과정이다.Step (a) is a process of extracting predetermined text data from various online services.

여기서 온라인 서비스라 함은 웹, SNS 등 통신망을 이용한 다양한 온라인 서비스 매체일 수 있으며, 어느 하나의 온라인 서비스에 제한되지 않는다. 또한 향후 등장할 미래의 온라인 서비스를 포함할 수 있음은 물론이다.Here, the online service may be various online service media using a communication network such as the web and SNS, and is not limited to any one online service. It is also possible to include future online services that will appear in the future.

그리고 (b)단계는 (a)단계에 의해 추출된 텍스트 데이터에 포함되어 있는 문장을, 기 설정된 단어 단위로 분할 및 추출하는 과정으로서, 세부적으로 복수 개의 과정을 포함할 수 있다.Step (b) is a process of dividing and extracting sentences included in the text data extracted by step (a) into preset word units, and may include a plurality of processes in detail.

도 2는 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (b)단계의 세부 과정을 나타낸 도면이다.2 is a diagram illustrating a detailed process of step (b) in the method for constructing a dictionary for classification by subject for text data according to an embodiment of the present invention.

도 2에 도시된 바와 같이, (b)단계는 세부적으로 텍스트 데이터에 포함된 문장을 어절 별로 분할하는 (b-1)단계와, (b-1)단계에 의해 분할된 어절에서 명사군, 동사군 및 형용사군에 해당하는 단어만을 분리하여 추출하는 (b-2)단계를 포함한다.As shown in FIG. 2, step (b) includes step (b-1) of dividing sentences included in text data by word in detail, and noun groups and verbs from the words divided by step (b-1). and (b-2) step of separating and extracting only the words corresponding to the group and the adjective group.

(b-1)단계에서는 텍스트 데이터에 포함된 문장에 포함된 어절을 각각 분할 처리하게 된다. 예컨대, 텍스트 데이터에 '좋은 화장품은 피부를 촉촉하게 만든다.'와 같은 문장이 존재할 경우, 이는 '좋은', '화장품은', '피부를', '촉촉하게', '만든다.'의 어절 별로 분할할 수 있다.In step (b-1), the words included in the sentences included in the text data are separately processed. For example, if there is a sentence such as 'Good cosmetics make the skin moist' in the text data, it is classified according to the words 'good', 'cosmetics', 'skin', 'moisturize', and 'make'. can be divided

그리고 (b-2)단계에서는, 이와 같이 (b-1)단계에 의해 분할된 어절에서 명사군, 동사군 및 형용사군에 해당하는 단어만을 따로 분리하여 추출하도록 하기 위해, ‘대상’을 나타내는 명사, 대명사를 포함하는 명사군과, ‘행동’을 나타내는 동사를 포함하는 동사군, ‘강조’를 나타내는 형용사를 포함하는 형용사군을 제외한 나머지 요소들, 즉 조사, 종결어미, 수사, 관형사, 부사 등의 요소를 제거하게 된다.And in step (b-2), in order to separate and extract only words corresponding to the noun group, verb group, and adjective group from the words divided by step (b-1), , the remaining elements except for the noun group including pronouns, the verb group including verbs representing 'action', and the adjective group including adjectives representing 'emphasis', that is, particles, final endings, numerals, adjectives, adverbs, etc. elements of will be removed.

이와 같이 하는 이유는, 텍스트 데이터에 포함된 문장이 서술하고 있는 대상들을 명확하게 하여 각 문장의 단어 간 연관관계를 통한 분류, 즉 주제의 정확도를 높이기 위한 것이다.The reason for doing this is to clarify the objects described by the sentences included in the text data and to increase the accuracy of classification, that is, the topic, through association between words in each sentence.

다음으로, (c)단계는 (b)단계에 의해 분할된 단어를 전처리하는 과정으로서, 세부적으로 복수 개의 과정을 포함할 수 있다.Next, step (c) is a process of pre-processing the words divided by step (b), and may include a plurality of processes in detail.

도 3은 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (c)단계의 세부 과정을 나타낸 도면이다.3 is a diagram illustrating a detailed process of step (c) in the method for constructing a dictionary for classification by subject for text data according to an embodiment of the present invention.

도 3에 도시된 바와 같이, (c)단계는 세부적으로 (b)단계에 의해 추출된 단어 n개를 조합하여 조합그룹을 구성하는 (c-1)단계와, 조합그룹에 포함된 각 단어의 주제 지수를 추출하는 (c-2)단계와, 조합그룹에 포함된 각 단어에 대한 가중치를 부여하는 (c-3)단계를 포함한다.As shown in FIG. 3, step (c) includes, in detail, step (c-1) of configuring a combination group by combining n words extracted in step (b), and each word included in the combination group. A step (c-2) of extracting subject indices and a step (c-3) of assigning a weight to each word included in the combination group are included.

(c-1)단계에서는, 전술한 (b)단계에 의해 추출된 단어, 즉 명사군, 동사군 및 형용사군에 해당하는 각 단어를 n개 조합하여 조합그룹을 구성하게 된다. 여기서 n은 1이상의 자연수일 수 있으며, 본 실시예의 경우 1개의 단어 내지 3개의 단어를 조합하여 조합그룹을 구성하는 것으로 예시하였다.In step (c-1), a combination group is formed by combining n words corresponding to the noun group, verb group, and adjective group, which are the words extracted in the above-mentioned step (b). Here, n may be a natural number of 1 or more, and in the case of this embodiment, it is exemplified that a combination group is formed by combining one word to three words.

이와 같이 본 실시예에서 3개를 초과하여 조합그룹을 구성하지 않는 이유는, 데이터의 양이 많아질 경우 데이터의 주제가 너무 세분화될 수 있기 때문이다.In this way, the reason why the combination group does not exceed three in this embodiment is that the subject of the data may be too subdivided when the amount of data increases.

그리고 본 실시예에 의해 조합될 수 있는 조합그룹은, 이하와 같은 표 1과 같이 구성될 수 있다.In addition, the combination group that can be combined according to this embodiment can be configured as shown in Table 1 below.

이와 같은 표 1의 구성 기준은 단어의 연관 관계를 잘 표현하기 위한 조합으로서, 각 단어의 조합에 따라 단어의 의미가 텍스트 데이터의 주제를 잘 표현할 수 있는지를 중점으로 구성하였다. 이때 조합그룹은 문장 단위로 이루어 지며, 문장의 기준은 종결어미, 마침표, 줄바꿈을 기준으로 하였다.The configuration criteria of Table 1 are combinations to express the association of words well, and the meaning of each word according to the combination of words can express the subject of the text data well. At this time, the combination group is made up of sentence units, and the criteria for sentences were based on final endings, periods, and line breaks.

한편 머신러닝 알고리즘에서 텍스트로 이루어진 단어를 그대로 사용할 경우 속도가 매우 느리고 컴퓨터 하드웨어 자원의 소모가 매우 극심하기 때문에, 본 실시예에서는 각 단어에 정수 형태로 된 ID를 부여하고, 이를 정수 치환하기 위해 취합하여 임시 단어사전을 만든 뒤, 단어 단위로 나눠 놓은 대용량 데이터를 대용량 데이터 단어 벡터로 정수 치환할 수 있다.On the other hand, if words made of text are used as they are in the machine learning algorithm, the speed is very slow and the consumption of computer hardware resources is very extreme. After creating a temporary word dictionary, large-volume data divided by word can be replaced by integers with large-data word vectors.

이에 대한 예시는, 이하의 표 2에 나타난 바와 같이 구성될 수 있다.An example of this may be configured as shown in Table 2 below.

그리고 (c-2)단계에서는 이상과 같은 조합그룹에 포함된 각 단어의 주제 지수를 추출하게 되며, (c-3)단계에서는 조합그룹에 포함된 각 단어의 주제 지수에 대한 가중치를 부여하게 된다.In step (c-2), the subject index of each word included in the combination group is extracted, and in step (c-3), a weight is assigned to the subject index of each word included in the combination group. .

여기서 주제 지수에 대한 가중치 부여는 다양한 기준에 의해 이루어질 수 있다.Here, the weighting of the subject index may be performed according to various criteria.

본 실시예에서는, 조합그룹이 명사와 형용사를 포함하고 있을 경우, 해당 조합그룹의 명사에 추가적인 가중치를 부여하도록 하고, 특히 조합그룹에 포함된 형용사가 기 설정되어 분류된 강조 형용사일 경우, 해당 형용사에 추가적인 가중치를 부여하도록 하는 방법을 적용하였다.In this embodiment, when a combination group includes a noun and an adjective, additional weight is given to the noun of the combination group. A method of giving additional weight to was applied.

이와 같이 하는 이유는 명사와 함께 형용사를 포함하고 있는 조합그룹은 명사에 대해 강조의 의미가 들어가 있기 때문이며, 또한 명사에 대해 강력한 주제의 의미를 나타내는 형용사는 다른 형용사에 비해 보다 높은 주제의 의미를 나타내기 때문이다.The reason for doing this is that combination groups that include adjectives with nouns put emphasis on the nouns, and adjectives that show strong subject meanings for nouns show higher subject meanings than other adjectives. because bet

다음으로, (d)단계에서는 (c)단계에 의해 전처리된 단어를 분석하여 텍스트 데이터의 주제명을 도출하는 과정이 이루어진다. (d)단계 역시 마찬가지로, 세부적으로 복수 개의 과정을 포함할 수 있다.Next, in step (d), a process of deriving the subject title of the text data by analyzing the words preprocessed in step (c) is performed. Similarly, step (d) may include a plurality of processes in detail.

도 4는 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (d)단계의 세부 과정을 나타낸 도면이다.4 is a diagram showing a detailed process of step (d) in the method for constructing a dictionary for subject classification of text data according to an embodiment of the present invention.

도 4에 도시된 바와 같이, (d)단계는 세부적으로 (c)단계에 의해 전처리된 임의의 단어가 텍스트 데이터 내에서 노출되는 횟수를 산출하는 (d-1)단계와, 텍스트 데이터를 구성하는 단어들이 (d-1)단계의 임의의 단어와 동일한 주제를 나타내는 비율을 산출하는 (d-2)단계와, (d-1)단계 및 (d-2)단계의 결과값을 곱하여 주제 지수를 산출하는 (d-3)단계와, (d-3)단계의 주제 지수를 고려하여 텍스트 데이터의 주제명을 도출하는 (d-4)단계를 포함한다.As shown in FIG. 4, step (d) includes, in detail, step (d-1) of calculating the number of times the random word preprocessed by step (c) is exposed in text data, and step (d-1) of configuring text data. The topic index is obtained by multiplying the result values of steps (d-1) and (d-2) with step (d-2), which calculates the ratio of words representing the same topic as any word in step (d-1). Step (d-3) of calculating, and step (d-4) of deriving the subject name of the text data in consideration of the subject index of step (d-3).

(d-1)단계와 같이 전처리된 임의의 단어가 텍스트 데이터 내에서 노출되는 횟수를 산출하는 이유는, 각 텍스트 데이터 별로 자주 언급되는 단어일수록 해당 텍스트 데이터 내에서 중요한 패턴일 확률이 높기 때문이다.The reason for calculating the number of exposures of a random word preprocessed in step (d-1) in the text data is that the more often a word is mentioned in each text data, the more likely it is to be an important pattern in the text data.

또한 (d-2)단계와 같이 텍스트 데이터를 구성하는 단어들이 선택된 임의의 단어와 동일한 주제를 나타내는 비율을 산출하는 이유는, 텍스트 데이터에서 일관적으로 높은 비율로 나타나는 주제에 해당하는 단어는 해당 텍스트 데이터 내에서 중요한 패턴일 확률이 높기 때문이다.In addition, as in step (d-2), the reason why the ratio of words constituting text data representing the same subject as the selected random word is calculated is that the word corresponding to the subject that appears in a consistently high ratio in text data This is because it is likely to be an important pattern in the data.

이에 따라 (d-3)단계에서는 (d-1)단계 및 (d-2)단계의 결과값을 곱하여 주제 지수를 산출하여 높은 순서대로 패턴을 추천하게 되며, (d-4)단계에서는 이와 같은 (d-3)단계의 주제 지수를 고려하여 텍스트 데이터의 주제명을 도출하게 된다.Accordingly, in step (d-3), a topic index is calculated by multiplying the result values of steps (d-1) and (d-2), and patterns are recommended in the order of highest. The subject name of the text data is derived by considering the subject index of step (d-3).

그리고 이와 같은 (d)단계의 각 과정에서는, 전술한 (c-3)단계에서 부여된 각 단어 별 가중치가 적용될 수 있다.And in each process of step (d), the weight for each word assigned in step (c-3) may be applied.

이상과 같은 과정을 거쳐 텍스트 데이터의 주제명이 도출되며, (e)단계에서는 (a)단계 내지 (d)단계의 과정을 초기 분류체계로서 저장하게 된다.Through the above process, the subject name of the text data is derived, and in step (e), the process of steps (a) to (d) is stored as an initial classification system.

다만, 초기 분류체계는 신조어나 새롭게 등장하는 주제를 분류하지 못하기 때문에, 항상 새로운 데이터를 수집하고 분류체계를 생성하여 초기 분류체계에 추가시켜 확장을 할 필요가 있다.However, since the initial classification system cannot classify new words or emerging topics, it is necessary to always collect new data, create a classification system, and expand it by adding it to the initial classification system.

따라서 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법은, (e)단계에 의해 초기 분류체계가 저장된 이후, 새롭게 수행된 (a)단계 내지 (d)단계에 의해 도출된 신규 분류체계를 반영하여 확장 분류체계로 갱신하는 (f)단계를 더 포함할 수 있다.Therefore, in the subject-specific classification dictionary construction method for text data according to an embodiment of the present invention, after the initial classification system is stored by step (e), the newly performed steps (a) to (d) derive A step (f) of updating to an extended classification system by reflecting the new classification system may be further included.

도 5는 본 발명의 일 실시예에 따른 텍스트 데이터에 대한 주제 별 분류 사전 구축방법에 있어서, (f)단계의 세부 과정을 나타낸 도면이다.5 is a diagram illustrating a detailed process of step (f) in the method for constructing a dictionary for subject classification of text data according to an embodiment of the present invention.

도 5에 도시된 바와 같이, (f)단계는 세부적으로 초기 분류체계에 포함된 텍스트 데이터의 주제명과 신규 분류체계에 포함된 텍스트 데이터의 주제명의 유사도를 측정하는 (f-1)단계와, (f-1)단계의 유사도 측정 결과에 따라, 신규 분류체계에 포함된 텍스트 데이터의 주제명을 초기 분류체계에 포함된 텍스트 데이터의 주제명과 합산하거나 분리하여 확장 분류체계를 갱신하는 (f-2)단계를 포함할 수 있다.As shown in FIG. 5, step (f) includes, in detail, the step (f-1) of measuring the similarity between the subject name of the text data included in the initial classification system and the subject name of the text data included in the new classification system; ( According to the similarity measurement result in step f-1), the subject name of the text data included in the new classification system is added or separated from the subject name of the text data included in the initial classification system to update the extended classification system (f-2) can include

(f-1)단계의 경우, 초기 분류체계의 확장을 위해 초기 분류체계에 포함된 텍스트 데이터의 주제명과 신규 분류체계에 포함된 텍스트 데이터의 주제명의 유사도를 측정하는 과정이 이루어진다.In the case of step (f-1), a process of measuring similarity between subject names of text data included in the initial classification system and subject names of text data included in the new classification system is performed to expand the initial classification system.

이때 초기 분류체계에 포함된 텍스트 데이터의 주제명과 신규 분류체계에 포함된 텍스트 데이터의 주제명의 유사도 측정 방법은 다양하게 적용될 수 있음은 물론이다.At this time, of course, the similarity measuring method between the subject name of the text data included in the initial classification system and the subject name of the text data included in the new classification system can be applied in various ways.

본 실시예의 경우, 유사도를 판단하기 위한 방법으로서, 이하의 수식 1을 적용하는 것으로 하였다. 이와 같은 수식 1은 결과의 절대값이 1에 가까울수록 유사도가 높은 것으로 판단될 수 있다.In the case of this embodiment, as a method for determining the degree of similarity, Equation 1 below was applied. In Equation 1, the closer the absolute value of the result is to 1, the higher the degree of similarity.

그리고 (f-2)단계에서는, 이와 같은 (f-1)단계의 유사도 판단 결과 0.5 이상의 절대값을 가지는 텍스트 데이터의 주제명은 서로 통합하여 저장하고, 0.5 미만의 절대값을 가지는 텍스트 데이터의 주제명은 신규 분류로 인식하여 분리 저장함에 따라 확장 분류체계를 갱신하도록 할 수 있다.And in step (f-2), as a result of similarity determination in step (f-1), the subject names of text data having an absolute value of 0.5 or more are integrated and stored, and the subject names of text data having an absolute value of less than 0.5 are stored. As it is recognized as a new classification and stored separately, the extended classification system can be updated.

더불어 (f-1)단계의 유사도 판단 결과 0.5 이상의 절대값을 가지는 텍스트 데이터의 주제명이 복수 개인 경우에는, 가장 높은 절대값을 가지는 텍스트 데이터의 주제명만을 통합하여 저장할 수 있다.In addition, as a result of the similarity determination in step (f-1), when there are multiple subject names of text data having an absolute value of 0.5 or more, only the subject names of the text data having the highest absolute value can be collectively stored.

예컨대, 초기 분류 1 및 초기 분류 2와, 신규 분류 1 및 신규 분류 2 간의 유사도 판단 결과 위 표 3과 같은 테이블이 도출된 경우, 신규 분류 1은 초기 분류 1에 통합되고, 신규 분류 2는 초기 분류 2와 통합될 수 있다.For example, when a table as shown in Table 3 above is derived as a result of determining similarities between initial classification 1 and initial classification 2 and new classification 1 and new classification 2, new classification 1 is integrated into initial classification 1, and new classification 2 is the initial classification. 2 can be integrated.

이때 신규 분류 2 역시 절대값이 0.5 이상이므로 초기 분류 1과의 통합 조건을 만족하나, 신규 분류 2의 절대값이 신규 분류 1보다 높기 때문에 신규 분류 1이 초기 분류 1과 통합된다.At this time, the new category 2 also satisfies the integration condition with the initial category 1 because the absolute value is 0.5 or more. However, since the absolute value of the new category 2 is higher than that of the new category 1, the new category 1 is integrated with the initial category 1.

이상 설명한 과정과 같이, 본 발명은 종래 대용량 텍스트 데이터를 사람이 하나씩 확인하여 분류 체계를 구축하는 과정을 기계적으로 자동화하여 비용을 절감하고, 최신 데이터의 신속한 반영, 일관성 있는 분류 기준을 가지는 사전을 구축할 수 있게 된다.As described above, the present invention reduces costs by mechanically automating the process of establishing a classification system by checking conventional large-volume text data one by one by a person, promptly reflecting the latest data, and constructing a dictionary with consistent classification criteria. You will be able to do it.

이상과 같이 본 발명에 따른 바람직한 실시예를 살펴보았으며, 앞서 설명된 실시예 이외에도 본 발명이 그 취지나 범주에서 벗어남이 없이 다른 특정 형태로 구체화될 수 있다는 사실은 해당 기술에 통상의 지식을 가진 이들에게는 자명한 것이다. 그러므로, 상술된 실시예는 제한적인 것이 아니라 예시적인 것으로 여겨져야 하고, 이에 따라 본 발명은 상술한 설명에 한정되지 않고 첨부된 청구항의 범주 및 그 동등 범위 내에서 변경될 수도 있다.As described above, the preferred embodiments according to the present invention have been reviewed, and the fact that the present invention can be embodied in other specific forms without departing from the spirit or scope in addition to the above-described embodiments is a matter of ordinary knowledge in the art. It is self-evident to them. Therefore, the embodiments described above are to be regarded as illustrative rather than restrictive, and thus the present invention is not limited to the above description, but may vary within the scope of the appended claims and their equivalents.

Claims

A subject-specific classification dictionary construction method for text data performed through a management server in which a subject-specific classification dictionary construction program for text data stored in a storage medium is installed,
(a) extracting text data from an online service;
(b) dividing and extracting sentences included in the text data extracted in step (a) into preset word units;
(c) step of pre-processing the words divided by step (b);
(d) deriving subject names of the text data by analyzing the words preprocessed by step (c);
(e) storing the process of steps (a) to (d) as an initial classification system; and
Including a step (f) of updating to an extended classification system by reflecting the new classification system derived by steps (a) to (d) newly performed after the initial classification system is stored,
In step (c),
(c-1) forming a combination group by combining n words (where n is a natural number equal to or greater than 1) extracted in step (b); and
And (c-2) step of pre-processing the extracted word based on the combination group,
The combination group is formed in sentence units, and the criterion of the sentence is based on a final ending, a period, or a line break,
In the step (c-1), one to three words are combined to form the combination group,
In step (d),
(d-1) calculating the number of exposures of the words preprocessed by step (c) in the text data;
(d-2) calculating a ratio in which the words constituting the text data represent the same subject as the preprocessed words;
(d-3) calculating a subject index by multiplying the result values of steps (d-1) and (d-2); and
(d-4) deriving a subject name of the text data in consideration of the subject index of the step (d-3);
In step (f),
(f-1) measuring similarity between subject names of text data included in the initial classification system and subject names of text data included in the new classification system; and
According to the similarity measurement result of step (f-1), updating the extended classification system by adding or separating the subject name of the text data included in the new classification system with the subject name of the text data included in the initial classification system ( Step f-2); Including,
A method for constructing a subject-specific classification dictionary for text data.

According to claim 1,
In step (b),
(b-1) dividing a sentence included in the text data by word; and
(b-2) step of separating and extracting only words corresponding to the noun group, verb group, and adjective group from the words divided by the step (b-1);
including,
A method for constructing a subject-specific classification dictionary for text data.

According to claim 1,
In the step (c-2),
(c-2-1) extracting the subject index of each word included in the combination group; and
(c-2-2) assigning a weight to each word included in the combination group;
including,
A method for constructing a subject-specific classification dictionary for text data.

According to claim 3,
In the step (c-2-2),
When the combination group includes a noun and an adjective, to give additional weight to the noun of the combination group,
A method for constructing a subject-specific classification dictionary for text data.

According to claim 3,
In the step (c-2-2),
If the adjective included in the combination group is a pre-set and classified emphatic adjective, additional weight is given to the adjective,
A method for constructing a subject-specific classification dictionary for text data.

According to claim 1,
In the step (f-1),

Judging the similarity between the subject name of the text data included in the initial classification system and the subject name of the text data included in the new classification system through the formula of
A method for constructing a subject-specific classification dictionary for text data.

According to claim 6,
In the step (f-2),
As a result of the similarity determination in step (f-1), subject names of text data having an absolute value of 0.5 or more are integrated and stored, and subject names of text data having an absolute value of less than 0.5 are recognized as a new category and stored separately. to update the extended taxonomy,
A method for constructing a subject-specific classification dictionary for text data.

According to claim 7,
In the step (f-2),
As a result of the similarity determination in step (f-1), if there are multiple subject names of text data having an absolute value of 0.5 or more, only the subject name of the text data having the highest absolute value is integrated and stored.
A method for constructing a subject-specific classification dictionary for text data.

A computer-readable storage medium storing a program for executing the subject-specific classification dictionary construction method for text data of any one of claims 1 to 8 in a computer.