KR102269737B1

KR102269737B1 - Information Classification Method Based on Deep-Learning And Apparatus Thereof

Info

Publication number: KR102269737B1
Application number: KR1020190110248A
Authority: KR
Inventors: 온병원
Original assignee: 군산대학교 산학협력단
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2021-06-28
Also published as: KR102269737B9; KR20210029007A

Abstract

이하의 실시예는 딥러닝 기반 정보 분류 방법 에 관한 것이다. 실시예에 따른 딥러닝 기반 정보 분류 방법은 엔터티에 관련된 웹 페이지들을 수집하는 단계; 웹 페이지들 각각에 대응하여, 웹 페이지들의 집합에서 해당 웹 페이지에 포함된 적어도 하나의 단어의 중요도, 및 웹 페이지들 중 해당 단어를 포함하는 웹 페이지의 정규화된 빈도수에 따라 해당 웹 페이지 특징들을 추출하는 단계; 웹 페이지들의 특징들에 기초하여, 웹 페이지들이 엔터티와 관련된 정도를 지시하는 빈도수들을 결정하는 단계; 웹 페이지들의 빈도수들에 기초하여, 웹 페이지들을 엔터티에 대한 관련성 데이터 및 비관련성 데이터로 분류하는 단계; 및 분류 결과 및 특징들을 이용하여 학습 데이터 세트를 생성하는 단계를 포함할 수 있다.The following embodiment relates to a deep learning-based information classification method. A deep learning-based information classification method according to an embodiment comprises: collecting web pages related to an entity; Corresponding to each of the web pages, the corresponding web page features are extracted according to the importance of at least one word included in the corresponding web page in the set of web pages, and the normalized frequency of the web page including the corresponding word among the web pages. to do; determining, based on characteristics of the web pages, frequencies indicative of the degree to which the web pages are associated with an entity; classifying the web pages into relevance data and non-relevance data for an entity based on the frequencies of the web pages; and generating a training data set using the classification result and features.

Description

Information Classification Method Based on Deep-Learning And Apparatus Thereof

이하의 실시예는 딥러닝 기반의 정보 분류 방법 및 이를 수행하는 장치에 관한 것이다.The following embodiment relates to a deep learning-based information classification method and an apparatus for performing the same.

4차 산업 혁명과 더불어, 기술 혁신 및 기업 생산성 향상을 위해서 빅데이터로부터 가치 있는 지식을 추출하는 인공지능 기반의 데이터 마이닝 기법들이 중요한 역할을 하고 있다.With the fourth industrial revolution, artificial intelligence-based data mining techniques that extract valuable knowledge from big data are playing an important role for technological innovation and corporate productivity improvement.

영화나 호텔과 같은 엔터티(entity)에 대한 모든 정보는 웹 상에 존재하며, 사람의 개입 없이 엔터티의 지식을 구축하기 위해서는 수 많은 웹 페이지들을 수집하여, 웹 페이지와 엔터티의 관련성을 밝혀내야 한다.All information about an entity such as a movie or a hotel exists on the web, and in order to build knowledge of an entity without human intervention, it is necessary to collect numerous web pages and reveal the relationship between the web page and the entity.

엔터티는 비슷한 이름을 가진 다수의 엔터티들이 존재하거나, 애매모호한 의미를 갖거나, 웹 페이지에서의 엔터티가 적게 존재하거나, 시간에 따른 엔터티의 변화 등의 문제들을 갖고 있다. 또한 실제 세계에서 웹 페이지는 단순한 정적 데이터가 아니라, 시간에 따라 유동적으로 변하는 복잡한 동적 데이터이다.Entity has problems such as multiple entities with similar names, ambiguous meanings, few entities in web pages, and entity changes over time. Also, in the real world, web pages are not just static data, but complex dynamic data that changes fluidly over time.

SVM(Support Vector Machines) 그리고 앙상블 모델과 같은 전통적인 분류 기법들은 도메인 전문가에 의해 엔터티를 구별 짓는 차별성 있는 특징들을 직접 개발해야 하는 어려움을 갖고 있다. 반면에, 딥러닝 기법들은 모델 자체에서 특징들을 추출하기 때문에 특징을 개발할 필요가 없으며, 복잡한 비선형 문제들을 잘 해결하고 있으나 딥러닝 기법들은 모델의 복잡성에 의해 과적합 문제를 안고 있다. Traditional classification techniques such as Support Vector Machines (SVM) and ensemble models have difficulties in developing distinctive features that distinguish entities by domain experts. On the other hand, deep learning techniques do not need to develop features because they extract features from the model itself, and they solve complex nonlinear problems well, but deep learning techniques have overfitting problems due to the complexity of the model.

이러한 문제들을 인공지능 기반의 데이터 마이닝 기법들을 활용하여 엔터티의 지식을 구축하는 방법들이 제안되고 있다.Methods for building knowledge of entities by using AI-based data mining techniques to solve these problems have been proposed.

실시예에서, 엔터티의 지식을 구축하기 위한 학습 데이터 세트를 자동으로 생성하며 이를 학습하기 위한 딥러닝 기반의 정보 분류 방법 및 이를 수행하는 장치를 제공하기 위한 것이다.In an embodiment, it is to provide a deep learning-based information classification method for automatically generating and learning a learning data set for building knowledge of an entity, and an apparatus for performing the same.

엔터티에 관련된 웹 페이지들을 수집하는 단계; 상기 웹 페이지들 각각에 대응하여, 상기 웹 페이지들의 집합에서 해당 웹 페이지에 포함된 적어도 하나의 단어의 중요도, 및 상기 웹 페이지들 중 해당 단어를 포함하는 웹 페이지의 정규화된 빈도수에 따라 해당 웹 페이지 특징들을 추출하는 단계; 상기 웹 페이지들의 특징들에 기초하여, 상기 웹 페이지들이 상기 엔터티와 관련된 정도를 지시하는 빈도수들을 결정하는 단계; 상기 웹 페이지들의 빈도수들에 기초하여, 상기 웹 페이지들을 상기 엔터티에 대한 관련성 데이터 및 비관련성 데이터로 분류하는 단계; 및 상기 분류 결과 및 상기 특징들을 이용하여 학습 데이터 세트를 생성하는 단계를 포함하는, 딥러닝 기반 정보 분류 방법이 제공될 수 있다.collecting web pages related to the entity; Corresponding to each of the web pages, the corresponding web page according to the importance of at least one word included in the corresponding web page in the set of web pages, and the normalized frequency of web pages including the corresponding word among the web pages extracting features; determining, based on characteristics of the web pages, frequencies indicative of the degree to which the web pages are associated with the entity; classifying the web pages into relevance data and non-relevance data for the entity based on the frequencies of the web pages; and generating a training data set by using the classification result and the features, a deep learning-based information classification method may be provided.

상기 웹 페이지들을 수집하는 단계는, 상기 엔터티에 대한 속성들을 활용하여 조합 가능한 적어도 하나의 질의를 생성하는 단계; 및 웹 검색 엔진을 통하여 상기 적어도 하나의 질의에 대한 상기 웹 페이지들을 수집하는 단계를 포함할 수 있다.The collecting of the web pages may include: generating at least one query that can be combined using attributes of the entity; and collecting the web pages for the at least one query through a web search engine.

상기 엔터티와의 관련성 데이터 및 비관련성 데이터로 분류하는 단계는, 상기 웹 페이지들의 빈도수 분포를 활용하여 기준 이상의 빈도수가 발생하는 웹 페이지들을 상기 관련성 데이터로 분류하고, 상기 기준 미만의 빈도수가 발생하는 웹 페이지들은 상기 비관련성 데이터로 분류하는 단계를 포함할 수 있다.The classifying into relevance data and non-relevance data with the entity includes classifying web pages having a frequency greater than or equal to a standard as the relevance data by using the frequency distribution of the web pages, and a web having a frequency lower than the standard. Pages may include classifying the irrelevant data.

상기 해당 웹 페이지의 특징들을 추출하는 단계는, 상기 해당 웹 페이지에 포함된 단어들을 하기 수학식에 대입하여 상기 엔터티와의 관련성을 계산하는 단계; 및 상기 관련성이 계산된 단어들 중 상기 계산된 값이 높은 단어들을 미리 정해진 기준에 따라 상기 특징으로 추출하는 단계를 포함할 수 있다.The extracting of the features of the corresponding web page may include: calculating a relevance to the entity by substituting words included in the corresponding web page into the following equation; and extracting words having a high calculated value among the words for which the relevance is calculated as the feature according to a predetermined criterion.

수학식:Formula:

-score(w,t)는 웹 페이지 w에서 단어 t의 엔터티에 대한 관련성dl고, α는 첫 번째 항과 두 번째 항의 가중 평균을 구하는 파라미터이고, tf-id(t)는 웹 페이지군 W에서 단어 t가 웹 페이지 w에서 포함된 단어 t가 중요한 정도를 나타내는 통계적 수치이고, freq(w)는 단어 t가 포함된 웹 페이지 w의 빈도수이고, ∑freq(w’)는 문서군 W에 포함된 웹 페이지 w’ 들의 빈도수의 합이고, 좌항은 문서군 W 내 단어 t의 중요성이고, 우항은 단어 t가 포함된 웹 페이지 w의 정규화된 빈도수 값임--score(w,t) is the relevance dl of the entity of the word t in the web page w, α is the parameter for weighted average of the first and second terms, and tf-id(t) is the relevance dl of the word t in the web page group W The word t is a statistical number indicating the degree to which the word t contained in the web page w is significant, freq(w) is the frequency of the web page w containing the word t, and ∑freq(w') is the number of elements contained in the document family W The sum of the frequencies of web pages w', the left term is the importance of the word t in the document group W, and the right term is the normalized frequency value of the web page w containing the word t-

상기 학습 데이터 세트에 기초하여 정보 분류 모델을 학습하는 단계; 상기 학습된 정보 분류 모델에 기초하여 상기 웹 페이지들을 분류하는 단계; 및 상기 정보 분류 모델 및 상기 학습 데이터 세트를 제공하는 단계를 더 포함할 수 있다.learning an information classification model based on the training data set; classifying the web pages based on the learned information classification model; and providing the information classification model and the training data set.

상기 학습된 정보 분류 모델에 기초하여 상기 웹 페이지들을 분류하는 단계는, 상기 웹 페이지들에 대해 상기 엔터티와 관련성을 분류하는 단계를 를 포함할 수 있다.Classifying the web pages based on the learned information classification model may include classifying the entity and relevance for the web pages.

딥러닝 기반 정보 분류를 위한 장치에 있어서, 하나 이상의 프로세서; 메모리; 및 상기 메모리에 저장되어 있으며 상기 하나 이상의 프로세서에 의하여 실행되도록 구성되는 하나 이상의 프로그램을 포함하고, 상기 프로그램은, 엔터티에 관련된 웹 페이지들을 수집하는 단계; 상기 웹 페이지들 각각에 대응하여, 상기 웹 페이지들의 집합에서 해당 웹 페이지에 포함된 적어도 하나의 단어의 중요도, 및 상기 웹 페이지들 중 해당 단어를 포함하는 웹 페이지의 정규화된 빈도수에 따라 해당 웹 페이지 특징들을 추출하는 단계; 상기 웹 페이지들의 특징들에 기초하여, 상기 웹 페이지들이 상기 엔터티와 관련된 정도를 지시하는 빈도수들을 결정하는 단계; 상기 웹 페이지들의 빈도수들에 기초하여, 상기 웹 페이지들을 상기 엔터티에 대한 관련성 데이터 및 비관련성 데이터로 분류하는 단계; 및 상기 분류 결과 및 상기 특징들을 이용하여 학습 데이터 세트를 생성하는 단계를 포함하는, 장치가 제공될 수 있다.An apparatus for deep learning-based information classification, comprising: one or more processors; Memory; and one or more programs stored in the memory and configured to be executed by the one or more processors, the programs comprising: collecting web pages related to an entity; Corresponding to each of the web pages, the corresponding web page according to the importance of at least one word included in the corresponding web page in the set of web pages, and the normalized frequency of web pages including the corresponding word among the web pages extracting features; determining, based on characteristics of the web pages, frequencies indicative of the degree to which the web pages are associated with the entity; classifying the web pages into relevance data and non-relevance data for the entity based on the frequencies of the web pages; and generating a training data set using the classification result and the features.

실시예에서, 엔터티의 지식을 구축하기 위한 학습 데이터 세트를 자동으로 생성하며 이를 학습하기 위한 딥러닝 기반의 정보 분류 방법 및 이를 수행하는 장치를 제공할 수 있다.In an embodiment, it is possible to provide a deep learning-based information classification method for automatically generating and learning a learning data set for building knowledge of an entity, and an apparatus for performing the same.

도 1은 일실시예에 있어서, 딥러닝 기반의 정보 분류 방법의 흐름도이다.
도 2는 일실시예에 있어서, 조합 가능한 모든 질의들을 만들어내는 방법의 일례이다.
도 3은 일실시예에 있어서, 빈도수에 따른 관련성 또는 비관련성으로 분할을 설명하기 위한 개념도이다.
도 4는 일실시예에 있어서, 학습 데이터 세트의 자동 생성 방법을 설명하기 위한 학습 데이터 자동 생성 알고리즘이다.
도 5는 일실시예에 있어서, 모델링된 정보 분류 모델을 도시한 것이다.
도 6은 일실시예에 있어서, 딥러닝 기반의 정보 분류 방법을 수행하는 장치의 블록도이다.1 is a flowchart of a deep learning-based information classification method according to an embodiment.
2 is an example of a method for generating all combinable queries, according to one embodiment.
3 is a conceptual diagram for explaining division into relevance or non-relevance according to frequency according to an embodiment.
4 is an automatic learning data generation algorithm for explaining a method for automatically generating a training data set according to an embodiment.
5 illustrates a modeled information classification model according to an embodiment.
6 is a block diagram of an apparatus for performing a deep learning-based information classification method according to an embodiment.

이하, 본 발명의 실시예에 대해서 첨부된 도면을 참조하여 자세히 설명하도록 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시예들은 다양한 형태로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed herein are only exemplified for the purpose of explaining the embodiments according to the concept of the present invention, and the embodiment according to the concept of the present invention These may be embodied in various forms and are not limited to the embodiments described herein.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시예들을 특정한 개시형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention may have various changes and may have various forms, the embodiments will be illustrated in the drawings and described in detail herein. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes changes, equivalents, or substitutes included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만, 예를 들어 본 발명의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one element from another element, for example, without departing from the scope of rights according to the concept of the present invention, a first element may be named as a second element, Similarly, the second component may also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 “연결되어” 있다거나 “접속되어” 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 “직접 연결되어” 있다거나 “직접 접속되어” 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 표현들, 예를 들어 “~사이에”와 “바로~사이에” 또는 “~에 직접 이웃하는” 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but it is understood that other components may exist in between. it should be On the other hand, when it is mentioned that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle. Expressions describing the relationship between elements, for example, “between” and “between” or “directly adjacent to”, etc., should be interpreted similarly.

본 명세서에서 사용한 용어는 단지 특정한 실시예들을 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, “포함하다” 또는 “가지다” 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is used only to describe specific embodiments, and is not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

도 1은 일실시예에 있어서, 딥러닝 기반의 정보 분류 방법의 흐름도이다. 실시예에서, 정보 분류 방법을 수행하기 위한 장치를 통해 동작할 수 있다.1 is a flowchart of a deep learning-based information classification method according to an embodiment. In an embodiment, it may operate through an apparatus for performing an information classification method.

단계(110)에서 장치는 엔터티에 관련된 웹 페이지들을 수집한다.In step 110, the device collects web pages related to the entity.

실시예에서, 엔터티에 대한 웹 페이지를 수신, 수집 또는 크롤링(crawling)할 수 있다. 웹 페이지는 엔터티에 대한 속성들을 활용하여 생성된 질의(검색어)에 의해 수집될 수 있다.In embodiments, it may receive, collect, or crawl web pages for entities. A web page may be collected by a query (search word) generated by utilizing the attributes of the entity.

웹 페이지는 글이나 기호 등으로 일정한 의사나 관념 또는 사상을 나타낸 것을 의미할 수 있다. 예를 들어, 웹 페이지는 온라인 상에서 엔터티에 관한 정보를 포함하는 모든 형태를 의미할 수 있다.A web page may refer to an expression of a certain intention, idea, or thought in writing or a symbol. For example, a web page may mean any form that includes information about an entity online.

도 2는 실시예에서, 엔터티에 대한 속성들을 활용하여 조합 가능한 모든 질의를 생성하는 방법의 예시를 설명하기 위한 것이다.FIG. 2 is for explaining an example of a method of generating all combinable queries by utilizing attributes on entities in the embodiment.

실시예에서, 웹 페이지를 수신할 시, 엔터티 e에 대한 속성들을 활용하여 조합 가능한 모든 질의들을 생성할 수 있다. 예를 들어, 엔터티와 속성들은 표 1과 같이 3가지로 나타낼 수 있다고 가정한다. 실시예에 따른 엔터티에 대한 속성은 더 많은 가지수로 표현될 수 있다.In an embodiment, upon receiving a web page, all combinable queries may be generated utilizing attributes for entity e. For example, it is assumed that entities and properties can be expressed in three ways as shown in Table 1. An attribute of an entity according to an embodiment may be expressed in more numbers.

더불어, 상기의 엔터티의 속성들을 활용하여 생성된 질의들은 표 2와 같이 나타낼 수 있다.In addition, queries generated by utilizing the attributes of the above entities can be represented as shown in Table 2.

상기 표 1 및 표 2와 같이 속성 및 질의를 생성하여, 각 질의에 대해서 각각 웹 검색 엔진을 통해 웹 페이지들을 수집하되, 질의와 관련도가 높은 상위 k개의 웹 페이지들을 수집할 수 있다. 예를 들어, 모든 질의들(q1 내지 q7)을 통해서 수집된 웹 페이지들은 해당 엔터티 e의 주된 웹 페이지들을 의미할 수 있다.Attributes and queries are generated as shown in Tables 1 and 2, and web pages are collected through a web search engine for each query, but top k web pages with high relevance to the query may be collected. For example, web pages collected through all the queries q1 to q7 may mean main web pages of the corresponding entity e.

따라서, 엔터티 e의 속성들이

이라고 할 때, 모든 가능한 조합은 e={

}로 표현할 수 있다.Thus, the properties of entity e are

, all possible combinations are e={

} can be expressed as

단계(120)에서 장치는 수집된 웹 페이지들을 빈도수에 기초하여 분류한다.In step 120, the device classifies the collected web pages based on the frequency.

실시예에서, 웹 페이지들의 빈도수 분포에 기초하여 웹 페이지의 관련성 데이터 및 비관련성 데이터로 분류할 수 있다.In an embodiment, the web page may be classified into relevance data and non-relevance data based on a frequency distribution of the web pages.

웹 페이지의 빈도수가 높은 경우는 엔터티와의 관련성이 높은 것으로 판단할 수 있고, 반대로 웹 페이지의 빈도수가 낮은 경우에는 엔터티와의 관련성이 낮은 것으로 판단할 수 있다. 웹 페이지를 분류할 시, 예를 들어, 미리 정해진 기준에 따라 빈도수 상위 n개 및 빈도수 하위 n개의 웹 페이지들을 관련성 데이터 및 비관련성 데이터로 분류할 수 있다.When the frequency of the web page is high, it can be determined that the relevance to the entity is high, and when the frequency of the web page is low, it can be determined that the relevance with the entity is low. When classifying a web page, for example, the highest n frequency and n lowest frequency web pages may be classified into relevance data and non-relevance data according to a predetermined criterion.

도 3은 실시예에서, 웹 페이지에 대해서 엔터티와의 관련성 데이터 또는 비관련성 데이터인지 분류하는 방법을 설명하기 위한 개념도이다.FIG. 3 is a conceptual diagram for explaining a method of classifying a web page as data related to an entity or data not related to an entity, according to an embodiment.

도 3의 그래프를 참조하면, 각 엔터티를 빈도수 별로 내림차순 정렬했을 때, 빈도수가 높은 문서들은 엔터티와 관련성이 높으며, 빈도수가 낮은 문서들은 엔터티와의 관련성이 낮은 것으로 분류할 수 있다.Referring to the graph of FIG. 3 , when each entity is sorted in descending order by frequency, documents with high frequency may be classified as having high relevance to the entity, and documents with low frequency may be classified as having low relevance to the entity.

실시예에서, 관련성 또는 비관련성에 기초하여 빈도수가 높은 문서들은 관련성 데이터로, 빈도수가 낮은 문서들은 비관련성 데이터로 분할한다.In an embodiment, documents with high frequency are divided into relevance data and documents with low frequency are divided into irrelevant data based on relevance or non-relevance.

단계(130)에서 장치는, 분류된 데이터들 각각으로부터 엔터티에 대한 적어도 하나의 특징을 추출한다.In step 130, the device extracts at least one feature of the entity from each of the classified data.

특징을 추출하기 위해서, 관련성 데이터 및 비관련성 데이터에 대해 각각 엔터티와의 관련성을 계산하고, 계산된 결과에 기초하여 특징을 추출할 수 있다.In order to extract the feature, the relevance to the entity may be calculated for each of the relevance data and the non-relevance data, and the feature may be extracted based on the calculated result.

실시예에서, 엔터티와의 관련성은 수학식 1에 기초하여 계산될 수 있다.In an embodiment, the association with an entity may be calculated based on Equation (1).

여기서, score(w,t)는 웹 페이지 w에서 단어 t의 엔터티에 대한 관련성을 의미하고, α는 첫 번째 항과 두 번째 항의 가중 평균을 구하는 파라미터이고, tf-idf(t)는 문서군 W에서 웹 페이지 w에 포함된 단어 t가 중요한 정도를 나타내는 통계적 수치를 의미하고, freq(w)는 단어 t가 포함된 웹 페이지 w의 문서군 W에서의 빈도수이고, ∑freq(w')는 문서군 W에 포함된 웹 페이지 w' 들의 빈도수의 합이다. tf-idf()는 단어 t가 웹 페이지 w에 출현하면서, 문서군 W에서 빈번히 발생하지 않을 때, 웹 페이지 w를 분류하는데 중요한 단어를 의미한다. freq()는 단어 t가 포함된 웹 페이지 w의 빈도수 특징을 통해서 빈번히 발생한 문서의 중요성을 의미한다. 수학식 1에서 좌항은 문서군 W 내 단어 t의 중요성을 의미하고, 우항은 단어 t가 포함된 웹 페이지 w의 정규화된 빈도수 값을 의미한다.Here, score(w,t) means the relevance to the entity of the word t in the web page w, α is a parameter for calculating the weighted average of the first and second terms, and tf-idf(t) is the document group W means a statistical value indicating the degree of importance of the word t included in the web page w in , freq(w) is the frequency in the document group W of the web page w containing the word t, and ∑freq(w') is the document It is the sum of the frequencies of web pages w' included in group W. tf-idf() means an important word for classifying the web page w when the word t appears in the web page w and does not occur frequently in the document group W. freq() means the importance of a document that occurs frequently through the frequency feature of a web page w containing the word t. In Equation 1, the left term means the importance of the word t in the document group W, and the right term means the normalized frequency value of the web page w including the word t.

실시예에서, 관련성 데이터로부터 계산된 값에 기초하여 미리 정해진 개수의 상위의 단어들을 엔터티와의 관련성이 높은 단어들로 추출하여 관련성 특징으로 활용할 수 있다. 마찬가지로, 비관련성 데이터로부터 상위의 단어들을 추출하여 비관련성 특징으로 활용할 수 있다. 관련성 특징들 및 비관련성 특징들은 계산된 값의 내림차순으로 정렬되어 하나의 특징으로 구성될 수 있다.In an embodiment, based on a value calculated from the relevance data, a predetermined number of higher words may be extracted as words having high relevance to an entity and used as relevance characteristics. Similarly, higher words may be extracted from irrelevant data and used as irrelevant features. Relevant features and non-relevant features may be arranged in a descending order of a calculated value to constitute one feature.

단계(140)에서 장치는, 특징을 이용하여 학습 데이터 세트를 생성한다.In step 140, the device generates a training data set using the features.

실시예에서, 관련성 데이터 및 비관련성 데이터로부터 추출된 관련성 특징 및 비관련성 특징을 이용하여 학습 데이터 세트를 생성할 수 있다.In embodiments, a training data set may be generated using relevant and non-relevant features extracted from the relevant and non-relevant data.

도 4는 실시예에서, 학습 데이터 세트를 자동으로 생성하는 방법을 설명하기 위한 학습 데이터 자동 생성 알고리즘의 예시이다.4 is an example of an algorithm for automatically generating training data for explaining a method for automatically generating a training data set in an embodiment.

실시예에서, 앞서 설명하다시피, 엔터티 e의 속성들(a₁~a_n)을 활용하여 조합 가능한 모든 질의를 생성할 수 있다. 생성된 질의들을 통해 엔터티 e와 관련도가 높은 상위의 문서들을 수집할 수 있다. 수집된 문서들의 빈도수 분포에 기초하여, 빈도수가 높은 문서들을 관련성 데이터로 분류하고, 빈도수가 낮은 문서들은 비관련성 데이터로 분류할 수 있다.In an embodiment, as described above, all combinable queries may be generated by utilizing the _{attributes (a 1} to a _{n ) of the entity e.} Through the generated queries, higher-order documents that are highly related to entity e can be collected. Based on the frequency distribution of the collected documents, documents with high frequency may be classified as relevance data, and documents with low frequency may be classified as irrelevant data.

분류된 데이터들 내 단어들에 대해서 상기 수학식 1에 기초하여 엔터티와의 관련성을 계산하고, 계산된 관련성 및 비관련성 데이터들로부터 특징들을 추출할 수 있다. 추출된 특징들을 통해 관련성 데이터 및 비관련성 데이터를 표현하는 학습 데이터 세트를 생성할 수 있다.For words in the classified data, a relevance to an entity may be calculated based on Equation 1 above, and features may be extracted from the calculated relevance and non-relevance data. Through the extracted features, it is possible to generate a training data set representing relevant data and non-relevant data.

관련성 데이터 및 비관련성 데이터, 데이터들로부터 추출된 특징들 및 생성된 학습 데이터 세트는 아래와 같이 나타낼 수 있다.Relevance and non-relevance data, features extracted from the data, and the generated training data set can be represented as follows.

관련성 데이터: w1={t1, t2, t3, t4}, w2={t1, t3, t4, t5}Relevance data: w1={t1, t2, t3, t4}, w2={t1, t3, t4, t5}

비관련성 데이터: w3={t2, t3, t4, t5}, w5={t3, t4, t5, t6}Unrelevant data: w3={t2, t3, t4, t5}, w5={t3, t4, t5, t6}

관련성 데이터의 상위 2개의 측정값 단어(단어, 측정값): (t1, score(w1, t1)), (t2, score(w1, t2))Top 2 measure words (word, measure) of relevance data: (t1, score(w1, t1)), (t2, score(w1, t2))

비관련성 데이터의 상위 2개의 측정값 단어(단어, 측정값): (t5, score(w3, t5)), (t6, score(w4, t6))Top 2 measure words (words, measures) in unrelated data: (t5, score(w3, t5)), (t6, score(w4, t6))

측정값 내림차순을 통한 특징: {t1=score(w1, t1), t2=score(w1, t2), t5=score(w3, t5), t6=score(w4, t6)}Features in descending order of measurements: {t1=score(w1, t1), t2=score(w1, t2), t5=score(w3, t5), t6=score(w4, t6)}

추출된 특징으로 표현한 관련성 데이터: w1=<관련성, t1=score(w1, t1), t2=score(w1,t2), t3=0, t4=0 >, w2=<관련성, t1=score(w1, t1), t3=0, t4=0, t5=score(w3, t5)>Relevance data expressed as extracted features: w1=<relevance, t1=score(w1, t1), t2=score(w1, t2), t3=0, t4=0 >, w2=<relevance, t1=score(w1) , t1), t3=0, t4=0, t5=score(w3, t5)>

추출된 특징으로 표현한 비관련성 데이터: w3=<비관련성, t2=score(w1, t2), t3=0, t4=0, t5=score(w3, t5)>, w4=<비관련성, t3=0, t4=0, t5=score(w3, t5), t6=score(w4, t6)>Unrelevant data expressed as extracted features: w3=<unrelevant, t2=score(w1, t2), t3=0, t4=0, t5=score(w3, t5)>, w4=<unrelevant, t3= 0, t4=0, t5=score(w3, t5), t6=score(w4, t6)>

학습 데이터 세트: w1=<관련성, t1=score(w1, t1), t2=score(w1,t2), t3=0, t4=0 >, w2=<관련성, t1=score(w1, t1), t3=0, t4=0, t5=score(w3, t5)>, w3=<비관련성, t2=score(w1, t2), t3=0, t4=0, t5=score(w3, t5)>, w4=<비관련성, t3=0, t4=0, t5=score(w3, t5), t6=score(w4, t6)>Training dataset: w1=<relevance, t1=score(w1, t1), t2=score(w1,t2), t3=0, t4=0 >, w2=<relevance, t1=score(w1, t1), t3=0, t4=0, t5=score(w3, t5)>, w3=<relevant, t2=score(w1, t2), t3=0, t4=0, t5=score(w3, t5)> , w4=<relevance, t3=0, t4=0, t5=score(w3, t5), t6=score(w4, t6)>

학습 데이터 세트로 학습된 엔터티의 지식이 구축된 딥러닝 기반 정보 분류기: DA deep learning-based information classifier on which knowledge of entities trained with a training dataset is built: D

테스트 웹 페이지: w5 =<t5, t6, t7, t8>Test web page: w5 =<t5, t6, t7, t8>

추출된 특징으로 표현한 테스트 웹 페이지: w5=<미정, t5=score(w3, t5), t6=score(w4, t6), t7=0, t8=0>Test web pages expressed as extracted features: w5=<undecided, t5=score(w3, t5), t6=score(w4, t6), t7=0, t8=0>

테스트 웹 페이지에 대한 정보 분류기 D의 결과: 비관련성Results of Information Classifier D for Test Web Pages: Irrelevance

w1, w2, w3, w4, w5는 웹 페이지를 나타내고, t1, t2, t3, t4, t5, t6, t7, t8는 단어들을 나타내고, score(w, t)는 수학식 1로 웹 페이지 w에서의 단어 t의 엔터티에 대한 관련성을 나타내고, D는 딥러닝 모델을 의미할 수 있다.w1, w2, w3, w4, w5 represent web pages, t1, t2, t3, t4, t5, t6, t7, t8 represent words, score(w, t) is Equation 1 in the web page w represents the relevance of the word t to the entity, and D may mean a deep learning model.

단계(150)에서 장치는, 학습 데이터 세트에 기초하여 정보 분류 모델을 학습한다.In step 150, the device learns an information classification model based on the training data set.

실시예에서, 생성된 학습 데이터 세트를 통해서 딥러닝 기반 정보 분류 모델을 학습시킬 수 있다. 예를 들어, FNN(Feed-forward Neural Network), CNN(Convolutional Neural Network), RNN(Recurrent Neural Network)와 같은 딥러닝 모델에 학습 데이터 세트를 학습시킨다. 학습 결과로, 엔터티를 분류하는 엔터티의 지식이 구축된 모델이 모델링될 수 있다.In an embodiment, a deep learning-based information classification model may be trained through the generated training data set. For example, a training dataset is trained on a deep learning model such as a Feed-forward Neural Network (FNN), a Convolutional Neural Network (CNN), or a Recurrent Neural Network (RNN). As a learning result, a model in which knowledge of an entity that classifies an entity is built can be modeled.

도 5는 일실시예에 있어서, 모델링된 정보 분류 모델을 도시한 것이다.5 illustrates a modeled information classification model according to an embodiment.

도 5의 실시예에 따라 학습 데이터 세트들에 기초하여 LSTM 셀을 포함하는 딥러닝 모델이 학습될 수 있다. x₁ 내지 x_n _-1은 웹 페이지로부터 추출된 특징에 포함된 n개의 단어들일 수 있다.According to the embodiment of FIG. 5 , a deep learning model including an LSTM cell may be trained based on training data sets. x ₁ to x _n _-1 may be n words included in a feature extracted from a web page.

h₀ 내지 h_n _-1의 LSTM 레이어의 출력 값들은 합산되어 소프트맥스 레이어에 입력되어 정규화된 값들로 출력될 수 있다. 예를 들어, 소프트맥스 레이어의 출력 값은 엔터티와의 관련될 확률 및 엔터티와 관련되지 않을 확률 등을 포함할 수 있다.Output values of the LSTM layer of h ₀ to h _n ₋₁ may be summed and input to the softmax layer to be output as normalized values. For example, the output value of the softmax layer may include a probability of being related to an entity, a probability of not being related to an entity, and the like.

단계(160)에서 장치는, 학습된 정보 분류 모델에 기초하여 웹 페이지들을 분류한다.In step 160, the device classifies the web pages based on the learned information classification model.

실시예에서, 학습된 모델에 기초하여 웹 페이지들에 대해 엔터티와의 관련성을 분류할 수 있다. 예를 들어, 웹 페이지를 추출된 특징으로 표현할 수 있으며, 학습된 정보 분류 모델을 통해 평가할 수 있다.In embodiments, it is possible to classify associations with entities for web pages based on the learned model. For example, a web page can be expressed as an extracted feature, and it can be evaluated through a learned information classification model.

실시예에서, 엔터티와의 관련성을 분류하기 위해 학습된 정보 분류 모델 및 엔터티에 대한 학습이 완료된 데이터들이 제공될 수 있다.In an embodiment, in order to classify the relevance to the entity, a learned information classification model and data on which the learning of the entity is completed may be provided.

도 6은 일실시예에 있어서, 딥러닝 기반의 정보 분류 방법을 수행하는 장치의 블록도이다.6 is a block diagram of an apparatus for performing a deep learning-based information classification method according to an embodiment.

실시예에 따른 장치(600)는 메모리(610) 및 프로세서(620)를 포함하여 구성될 수 있으며, 프로세서(620)를 통해 메모리(610)에 저장된 딥러닝 기반의 정보 분류 방법이 실행될 수 있다. 실시예에 따른 딥러닝 기반의 정보 분류 방법은 도 1 내지 도 5를 통해 설명된 방법에 기초한다.The device 600 according to the embodiment may be configured to include a memory 610 and a processor 620 , and a deep learning-based information classification method stored in the memory 610 may be executed through the processor 620 . The deep learning-based information classification method according to the embodiment is based on the method described with reference to FIGS. 1 to 5 .

장치(600)는 엔터티에 대한 웹 페이지를 수신(또는 획득, 수집, 크롤링하고, 웹 페이지를 처리함으로써, 엔터티에 관련성 높은 웹 페이지들을 제공할 수 있다.The device 600 may provide web pages highly relevant to the entity by receiving (or obtaining, collecting, crawling, and processing the web page for the entity).

장치(600)는 PC(personal computer), 데이터 서버, 또는 휴대용 장치로 구현될 수 있다. 예를 들어, 휴대용 장치는 랩탑(laptop) 컴퓨터, 이동 전화기, 스마트 폰(smart phone), 태블릿(tablet) PC, 모바일 인터넷 디바이스(mobile internet device(MID)), PDA(personal digital assistant), EDA(enterprise digital assistant), 디지털 스틸 카메라(digital still camera), 디지털 비디오 카메라(digital video camera), PMP(portable multimedia player), PND(personal navigation device 또는 portable navigation device), 휴대용 게임 콘솔(handheld game console), e-북(e-book) 또는 스마트 디바이스(smart device)로 구현될 수 있다. 스마트 디바이스는 스마트 와치(smart watch), 스마트 밴드(smart band), 또는 스마트 링(smart ring)으로 구현될 수 있다.The device 600 may be implemented as a personal computer (PC), a data server, or a portable device. For example, a portable device may be a laptop computer, a mobile phone, a smart phone, a tablet PC, a mobile internet device (MID), a personal digital assistant (PDA), an EDA ( enterprise digital assistant), digital still camera, digital video camera, PMP (portable multimedia player), PND (personal navigation device or portable navigation device), handheld game console, It may be implemented as an e-book or a smart device. The smart device may be implemented as a smart watch, a smart band, or a smart ring.

장치(600)는 엔터티에 관련된 웹 페이지들을 수집한다.The device 600 collects web pages related to an entity.

장치(600)는 수집된 웹 페이지들을 빈도수에 기초하여 분류한다.The device 600 classifies the collected web pages based on the frequency.

장치(600)는, 분류된 데이터들 각각으로부터 엔터티에 대한 적어도 하나의 특징을 추출한다.The apparatus 600 extracts at least one feature of an entity from each of the classified data.

실시예에서, 엔터티와의 관련성은 상기의 수학식 1에 기초하여 계산될 수 있다.In an embodiment, the association with the entity may be calculated based on Equation 1 above.

장치(600)는, 특징을 이용하여 학습 데이터 세트를 생성한다.The device 600 generates a training data set using the features.

장치(600)는, 학습 데이터 세트에 기초하여 정보 분류 모델을 학습한다.The device 600 trains an information classification model based on the training data set.

장치(600)는, 학습된 정보 분류 모델에 기초하여 웹 페이지들을 분류한다.The device 600 classifies the web pages based on the learned information classification model.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA). , a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more of these, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. may be permanently or temporarily embody in The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and carry out program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

collecting web pages related to the entity;
In response to each of the web pages,
Corresponding web page features using the importance of at least one word included in the web page in the set of web pages, and normalized frequencies of web pages including the corresponding word among the web pages - the characteristics are relevance characteristics and unrelated features;
determining, based on characteristics of the web pages, frequencies of the web pages;
classifying the web pages into relevance data and non-relevance data for the entity based on the frequencies of the web pages; and
generating a training data set using the classification result and the features;
containing,
Deep learning-based information classification method.

According to claim 1,
The step of collecting the web pages is,
generating at least one combinable query by utilizing attributes of the entity; and
collecting the web pages for the at least one query through a web search engine;
containing,
Deep learning-based information classification method.

According to claim 1,
The step of classifying into relevance data and non-relevance data with the entity comprises:
classifying web pages having a frequency greater than or equal to a standard as the relevance data by using a frequency distribution of the web pages, and classifying web pages having a frequency lower than the standard as the non-relevance data;
containing,
Deep learning-based information classification method.

According to claim 1,
The step of extracting the features of the corresponding web page,
calculating relevance to the entity by substituting the words included in the corresponding web page into the following equation; and
extracting words having a high calculated value among the words for which the relevance is calculated as the features according to a predetermined criterion
containing,
Deep learning-based information classification method.
Formula:

-score(w,t) is the relevance dl for the entity of the word t in the web page w, α is the parameter for calculating the weighted average of the first and second terms, and tf-id(t) in the web page group W The word t is a statistical number indicating the degree to which the word t contained in the web page w is important, freq(w) is the frequency of the web page w containing the word t, and ∑freq(w') is the number of elements contained in the document family W The sum of the frequencies of web pages w', the left term is the importance of the word t in the document group W, and the right term is the normalized frequency value of the web page w containing the word t-

According to claim 1,
learning an information classification model based on the training data set;
classifying the web pages based on the learned information classification model; and
providing the information classification model and the training data set;
further comprising,
Deep learning-based information classification method.

6. The method of claim 5,
Classifying the web pages based on the learned information classification model comprises:
classifying the entity and relevance for the web pages;
containing,
Deep learning-based information classification method.

A computer program stored in a medium for executing the method of any one of claims 1 to 6 in combination with hardware.

In the apparatus for classification of deep learning-based information,
one or more processors;
Memory; and
one or more programs stored in the memory and configured to be executed by the one or more processors;
The program is
collecting web pages related to the entity;
In response to each of the web pages,
Corresponding web page features using the importance of at least one word included in the web page in the set of web pages, and normalized frequencies of web pages including the corresponding word among the web pages - the characteristics are relevance characteristics and unrelated features;
determining, based on characteristics of the web pages, frequencies of the web pages;
classifying the web pages into relevance data and non-relevance data for the entity based on the frequencies of the web pages; and
generating a training data set using the classification result and the features;
containing,
Device.

9. The method of claim 8,
The step of collecting the web pages is,
generating at least one combinable query by utilizing attributes of the entity; and
collecting the web pages for the at least one query through a web search engine;
to do,
Device.

9. The method of claim 8,
The step of classifying into relevance data and non-relevance data with the entity comprises:
classifying web pages having a frequency greater than or equal to a standard as the relevance data by using a frequency distribution of the web pages, and classifying web pages having a frequency lower than the standard as the non-relevance data;
to do,
Device.

9. The method of claim 8,
The step of extracting the features of the corresponding web page,
calculating relevance to the entity by substituting the words included in the corresponding web page into the following equation; and
extracting words having a high calculated value among the words for which the relevance is calculated as the features according to a predetermined criterion
to do,
Device.
Formula:

9. The method of claim 8,
learning an information classification model based on the training data set;
classifying the web pages based on the learned information classification model; and
providing the information classification model and the training data set;
to do more,
Device.

13. The method of claim 12,
Classifying the web pages based on the learned information classification model comprises:
classifying the entity and relevance for the web pages;
to do,
Device.