KR102033416B1

KR102033416B1 - Method for generating data extracted from document and apparatus thereof

Info

Publication number: KR102033416B1
Application number: KR1020170155836A
Authority: KR
Inventors: 김기수; 이종수; 김환국; 김태은; 장대일; 유창훈; 손영남; 고은혜; 나사랑
Original assignee: 주식회사 루테스
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2019-10-17
Also published as: KR20190058141A

Abstract

외부로부터 수집된 문서로부터 필요로 하는 데이터를 추출하고 구조화된 정형 데이터를 생성하는 방법 및 그 장치가 제공 된다. 본 발명의 일 실시예에 따른 데이터 생성 방법은, 문서 제공 소스로부터 상기 문서를 획득하는 단계, 파라미터 데이터베이스로부터 상기 문서 제공 소스에 상응하는 데이터 추출 파라미터를 획득하는 단계, 상기 데이터 추출 파라미터에 기초하여 상기 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰을 구성하는 단계 및 상기 데이터 생성 룰에 따라 상기 문서로부터 구조화된 정형 데이터를 추출하는 단계를 포함한다.Provided are a method and apparatus for extracting required data from an externally collected document and generating structured structured data. According to an embodiment of the present invention, a method of generating data may include obtaining the document from a document providing source, obtaining a data extraction parameter corresponding to the document providing source from a parameter database, and based on the data extraction parameter. Constructing a data generation rule that defines a rule for extracting data from the document and a format for storing the extracted data; and extracting structured data from the document according to the data generation rule.

Description

METHOD FOR GENERATING DATA EXTRACTED FROM DOCUMENT AND APPARATUS THEREOF}

본 발명은 문서로부터 구조화된 정형 데이터를 생성하는 방법 및 그 장치에 관한 것이다. 보다 자세하게는, 외부로부터 수집된 문서로부터 필요로 하는 데이터를 추출하고 구조화된 정형 데이터를 생성하는 방법에 관한 것이다.The present invention relates to a method and apparatus for generating structured structured data from a document. More specifically, the present invention relates to a method for extracting data needed from documents collected from the outside and generating structured structured data.

통신 및 컴퓨팅 기술의 발달로 인터넷을 통해 수많은 문서들이 제공되고 있다. 통신망을 통해 제공되는 문서들로부터 필요한 데이터를 얻기 위해 크롤링(crawling) 또는 스크랩핑(scraping) 등의 기술이 이용된다. 크롤링(crawling) 또는 스크랩핑(scraping) 등의 기술은 웹 문서를 획득하고, 획득된 웹 문서로부터 데이터를 추출해 내는 기술이다. 예를 들어 포털 서비스의 검색엔진은 웹의 정보를 수집하여 인덱싱함으로써 검색 서비스를 제공할 수 있다. 여기서, 웹의 정보를 수집하고 데이터를 추출하는 구성은 스파이더(spider), 봇(bot) 또는 지능 에이전트 등으로 언급된다.With the development of communication and computing technologies, numerous documents are available over the Internet. Techniques such as crawling or scraping are used to obtain the necessary data from the documents provided over the communication network. Techniques such as crawling or scraping are techniques for obtaining a web document and extracting data from the obtained web document. For example, a search engine of a portal service may provide a search service by collecting and indexing information of the web. Here, the configuration of collecting information and extracting data of the web is referred to as a spider, a bot or an intelligent agent.

인터넷 등을 통해 수집되는 수많은 문서들로부터 원하는 데이터를 추출하고 데이터베이스화할 필요가 있다. 예를 들어, 다양한 웹 사이트를 통해 제공되는 소프트웨어 제품이나 네트워크 디바이스 등의 보안 취약점 정보를 수집하고 데이터베이스로 구성할 필요가 있다. 이를 위해, 컴퓨팅 장치는 문서들을 크롤링을 통해 수집하고, 문서를 분석하는 모듈(예를 들어, 파서(parser))을 구비하는 분석 장치가 문서로부터 필요한 데이터를 추출하고 분석한다. 여기서, 문서를 분석하는 모듈은 문서의 데이터 표출 방식에 따라 데이터를 추출하고 분석한다. 예를 들어, 문서를 분석하는 모듈은 하이픈(-) 뒤에 위치한 데이터를 추출하도록 구성될 수 있다.It is necessary to extract and database the desired data from numerous documents collected through the Internet. For example, security vulnerability information such as software products or network devices provided through various websites may need to be collected and organized into a database. To this end, the computing device collects documents through crawling, and an analysis device having a module (eg, a parser) for analyzing the documents extracts and analyzes the necessary data from the documents. Here, the module for analyzing the document extracts and analyzes the data according to the data presentation method of the document. For example, a module for analyzing a document may be configured to extract data located after the hyphen (-).

그러나, 이와 같이 문서의 데이터 표출 방식에 종속된 분석 모듈을 이용하여 데이터를 추출하는 경우, 다른 소스로부터 문서를 제공받거나 소스가 제공하는 문서의 포맷이 변경됨으로써 문서의 데이터 표출 방식이 변경될 때마다 분석 모듈의 재개발이 불가피하다. 또한, 재개발된 분석 모듈을 적용하기 위해 문서를 수집하고 분석하는 컴퓨팅 장치의 동작을 종료해야만 하는 문제가 있다.However, in the case of extracting data by using an analysis module dependent on the data presentation method of the document, whenever the data presentation method of the document is changed by receiving a document from another source or changing the format of the document provided by the source Redevelopment of the analysis module is inevitable. In addition, there is a problem in that the operation of the computing device that collects and analyzes the document must be terminated in order to apply the re-developed analysis module.

본 발명이 해결하고자 하는 기술적 과제는, 수집되는 문서의 형태나 표현 구조가 변경되더라도 다양한 문서로부터 수집되는 문서들을 분석하는 모듈을 재개발을 최소화할 수 있는 방법 및 그 장치에 관한 것이다.The technical problem to be solved by the present invention relates to a method and apparatus for minimizing redevelopment of a module for analyzing documents collected from various documents even if the form or presentation structure of the collected document is changed.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.Technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 데이터 생성 방법은, 문서 제공 소스로부터 상기 문서를 획득하는 단계, 파라미터 데이터베이스로부터 상기 문서 제공 소스에 상응하는 데이터 추출 파라미터를 획득하는 단계, 상기 데이터 추출 파라미터에 기초하여 상기 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰을 구성하는 단계 및 상기 데이터 생성 룰에 따라 상기 문서로부터 구조화된 정형 데이터를 추출하는 단계를 포함할 수 있다.In order to solve the above technical problem, a data generation method according to an embodiment of the present invention, obtaining the document from a document providing source, obtaining a data extraction parameter corresponding to the document providing source from a parameter database, Constructing a data generation rule defining a rule for extracting data from the document and a format for storing the extracted data based on the data extraction parameter and extracting structured data from the document according to the data generation rule It may include a step.

또한, 다른 실시 예에 따르면, 상기 데이터를 추출하는 규칙은 상기 문서 내에서 데이터의 위치를 지시하는 선택자를 포함하는 것을 특징으로 할 수 있다.According to another embodiment, the rule for extracting data may include a selector indicating a location of data in the document.

또한, 또 다른 실시 예에 따르면, 상기 추출하는 단계는 상기 선택자에 기초하여 상기 문서로부터 중간 데이터를 선택하는 단계 및 상기 중간 데이터를 분류하고, 상기 중간 데이터의 분류에 기초하여 상기 정형 데이터를 생성하는 단계를 포함할 수 있다.According to another embodiment, the extracting may include selecting intermediate data from the document based on the selector, classifying the intermediate data, and generating the structured data based on the classification of the intermediate data. It may include a step.

또한, 또 다른 실시 예에 따르면, 상기 정형 데이터를 생성하는 단계는 정규 표현식(Regular Expression), 기계학습 기반 분류 모델 및 텍스트 추출 알고리즘 중 적어도 하나를 이용하여 상기 중간 데이터를 분류하는 것을 특징으로 할 수 있다.According to another embodiment, the generating of the formal data may include classifying the intermediate data using at least one of a regular expression, a machine learning based classification model, and a text extraction algorithm. have.

또한, 또 다른 실시 예에 따르면, 상기 문서를 획득하는 단계는 상기 문서 제공 소스의 URL(Uniform Resource Locator)에 기초하여 웹 문서를 획득하는 것을 특징으로 하고, 상기 파라미터 데이터베이스는 상기 추출 파라미터와 상기 URL을 매칭하여 저장하는 것을 특징으로 하며, 상기 추출 파라미터를 획득하는 단계는 상기 URL에 기초하여 상기 추출 파라미터를 검색하는 단계를 포함할 수 있다.According to another embodiment, the obtaining of the document may include obtaining a web document based on a Uniform Resource Locator (URL) of the document providing source, wherein the parameter database may include the extraction parameter and the URL. And storing the extracted parameters, wherein the obtaining of the extracted parameters may include retrieving the extracted parameters based on the URL.

또한, 또 다른 실시 예에 따르면, 상기 데이터 생성 방법은 상기 데이터 생성 룰을 포함하는 룰 목록을 출력하는 단계 및 상기 목록으로부터 상기 데이터 생성 룰을 선택하는 입력을 수신하는 단계를 더 포함하고, 상기 데이터를 추출하는 단계는 상기 선택된 데이터 생성 룰에 기초하여 상기 정형 데이터를 추출하는 것을 특징으로 할 수 있다.According to another embodiment, the data generation method may further include outputting a rule list including the data generation rule and receiving an input for selecting the data generation rule from the list. The extracting may include extracting the structured data based on the selected data generation rule.

또한, 또 다른 실시 예에 따르면, 상기 문서를 획득하는 단계는 미리 설정된 URL로부터 웹 문서를 크롤링(crawling)하는 것을 특징으로 할 수 있다.Also, according to another embodiment, the obtaining of the document may include crawling a web document from a preset URL.

또한, 또 다른 실시 예에 따르면, 상기 문서를 획득하는 단계는 취약점 정보를 제공하는 웹 문서를 획득하는 것을 특징으로 하고, 상기 취약점 정보는 취약점 종류, 취약점 발생 제품명, 취약점 식별자 및 취약점 발생일에 대한 정보를 포함하는 것을 특징으로 할 수 있다.Further, according to another embodiment, the step of acquiring the document is characterized in that to obtain a web document providing the vulnerability information, the vulnerability information is about the type of vulnerability, product name, vulnerability identifier and vulnerability date It may be characterized by including the information.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 컴퓨팅 장치는, 외부의 문서 제공 소스로부터 획득된 문서를 획득하는 크롤러(crawler), 상기 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰에 대한 설정값인 데이터 추출 파라미터를 저장하는 파라미터 데이터베이스, 상기 문서 제공 소스에 기초하여 상기 데이터 추출 파라미터를 선택하고, 상기 데이터 추출 파라미터에 기초하여 상기 데이터 생성 룰을 구성하며, 상기 데이터 생성 룰에 따라 상기 문서로부터 구조화된 정형 데이터를 추출하도록 구성된 프로세서를 포함할 수 있다.In order to solve the technical problem, a computing device according to an embodiment of the present invention, a crawler for obtaining a document obtained from an external document providing source, a rule for extracting data from the document and extracted data A parameter database for storing data extraction parameters which are setting values for a data generation rule defining a format to be stored; selecting the data extraction parameters based on the document providing source, and selecting the data generation rules based on the data extraction parameters And a processor configured to extract structured structured data from the document according to the data generation rule.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 비일시적(non-transitory) 컴퓨터 판독 가능한 매체에 기록된 컴퓨터 프로그램은, 컴퓨터 프로그램의 명령어들이 서버의 프로세서에 의해 실행되는 경우에, 문서 제공 소스로부터 상기 문서를 획득하는 단계, 파라미터 데이터베이스로부터 상기 문서 제공 소스에 상응하는 데이터 추출 파라미터를 획득하는 단계, 상기 데이터 추출 파라미터에 기초하여 상기 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰을 구성하는 단계 및 상기 데이터 생성 룰에 따라 상기 문서로부터 구조화된 정형 데이터를 추출하는 단계를 포함하는 동작이 수행되는 것을 특징으로 할 수 있다.In order to solve the above technical problem, a computer program recorded in a non-transitory computer readable medium according to an embodiment of the present invention is a document that is executed when instructions of the computer program are executed by a processor of a server. Obtaining the document from a provisioning source, obtaining a data extraction parameter corresponding to the document provisioning source from a parameter database, a rule for extracting data from the document based on the data extraction parameter and storing the extracted data And forming a data generation rule defining a format and extracting structured data from the document according to the data generation rule.

도 1 및 도 2는 서로 다른 문서 제공 소스로부터 제공되는 문서의 예시를 도시한 도면이다.
도 3은 서로 다른 방식으로 데이터가 표현된 문서들로부터 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다.
도 4는 웹페이지 분석을 위한 소스코드의 예시를 도시한 도면이다.
도 5는 일 실시 예에 따른 컴퓨팅 장치가 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다.
도 6은 일 실시 예에 따라 컴퓨팅 장치가 문서로부터 정형 데이터를 생성하는 프로세스를 도시한 순서도이다.
도 7은 일 실시 예에 따라 컴퓨팅 장치가 정형 데이터를 추출하는 프로세스를 도시한 순서도이다.
도 8은 다른 실시 예에 따라 컴퓨팅 장치가 데이터 생성 룰을 선택하고, 정형 데이터를 추출하는 프로세스를 도시한 순서도이다.
도 9는 일 실시 예에 따른 룰 목록 및 데이터 생성 룰을 설명하기 위한 예시를 도시한 개념도이다.1 and 2 illustrate examples of documents provided from different document providing sources.
FIG. 3 is a diagram for describing a structure for generating structured data from documents in which data is represented in different ways.
4 is a diagram illustrating an example of source code for web page analysis.
5 is a diagram for describing a structure of generating, by a computing device, structured data, according to an exemplary embodiment.
6 is a flowchart illustrating a process of generating structured data from a document by a computing device according to an embodiment.
7 is a flowchart illustrating a process of extracting structured data by a computing device according to an embodiment.
8 is a flowchart illustrating a process in which a computing device selects a data generation rule and extracts structured data according to another embodiment.
9 is a conceptual diagram illustrating an example for explaining a rule list and a data generation rule, according to an exemplary embodiment.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various forms. The embodiments of the present invention make the posting of the present invention complete and the general knowledge in the technical field to which the present invention belongs. It is provided to fully convey the scope of the invention to those skilled in the art, and the present invention is defined only by the scope of the claims. Like reference numerals refer to like elements throughout.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that can be commonly understood by those skilled in the art. In addition, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase.

이하, 도면들을 참조하여 본 발명의 몇몇 실시예들을 설명한다.Hereinafter, some embodiments of the present invention will be described with reference to the drawings.

도 1 및 도 2는 서로 다른 문서 제공 소스로부터 제공되는 문서의 예시를 도시한 도면이다. 1 and 2 illustrate examples of documents provided from different document providing sources.

도 1은 NVD(Natinoal Vulnerability Database)에서 제공하는 취약점 정보인 CVE(Common Vulnerabilities and Exposures) 정보를 포함하는 웹페이지의 예시이다. CVE 정보는 CVE-ID(110), 취약점 개요 정보(Overview)(120), CVSS(130), CPE(140), CWE(150), 및 Reference(160)를 전부 또는 일부 포함한다. 취약점 개요 정보 (120)는 "place where a vulnerability was discovered", "(in) related software product names", "(when)conditions of the vulnerability occurrence", "(allow)attacker type", "(to)results of attack", "(via)means of attack", "(aka)vulnerability title in the reference site", "(a different vulnerability than)other CVE-IDs" 등으로 구성될 수 있다.1 is an example of a web page including Common Vulnerabilities and Exposures (CVE) information, which is vulnerability information provided by the Natinoal Vulnerability Database (NVD). The CVE information includes all or part of the CVE-ID 110, vulnerability overview 120, CVSS 130, CPE 140, CWE 150, and Reference 160. Vulnerability overview information (120) includes "place where a vulnerability was discovered", "(in) related software product names", "(when) conditions of the vulnerability occurrence", "(allow) attacker type", "(to) results of attack "," (via) means of attack "," (aka) vulnerability title in the reference site "," (a different vulnerability than) other CVE-IDs ", and the like.

도 2는 다른 웹 페이지(200)를 통해 제공되는 취약점 정보의 예시이다. 도 2에 도시된 예시에 따르면, 취약점 정보(210)는 취약점 식별자(Bugtraq ID; B-ID), 취약점의 종류(class), CVE-ID(CVE), 원격/로컬 정보(Remote, Local), 게시일, 취약 가능한 제품명(Vulnerable)을 포함하고 있다. 또한, 웹페이지(200)는 다른 취약점 정보로서 제목(260), 논의사항(discussion)(220), 익스플로잇 정보(exploit)(230), 솔루션(solution)(240), 참조(reference)(250) 등을 더 포함할 수 있다.2 is an example of vulnerability information provided through another web page 200. According to the example shown in FIG. 2, the vulnerability information 210 includes a vulnerability ID (Bugtraq ID; B-ID), a class of vulnerability, a CVE-ID (CVE), a remote / local information (Remote, Local), Date of publication, including vulnerable product name (Vulnerable). In addition, the webpage 200 may include title 260, discussion 220, exploit 230, solution 240, reference 250 as other vulnerability information. And the like may be further included.

도 1 및 도 2를 대비하면, 두 경우 모두 취약점 정보를 제공하는 웹페이지라 하더라도 그 정보의 구성이나 정보가 표시되는 위치들이 서로 상이한 것을 확인할 수 있다. 따라서, 도 1에 도시된 웹 페이지에서 CVE-ID 뒤에 표시되는 정보를 취약점에 대한 식별자로 추출하는 파서를 도 2에 도시된 웹 페이지(200)에 적용하는 경우, 상기 파서는 취약점에 대한 식별자를 추출하지 못하게 된다.In contrast to FIGS. 1 and 2, in both cases, even if the web page provides the vulnerability information, it can be confirmed that the configuration of the information or the locations where the information is displayed are different from each other. Therefore, when the parser extracting the information displayed after the CVE-ID from the web page shown in FIG. 1 as an identifier for the vulnerability is applied to the web page 200 shown in FIG. You won't be able to extract it.

따라서, 도 3에 도시된 바와 같은 구조를 통해서 정형 데이터를 생성할 수 있다. 도 3은 서로 다른 방식으로 데이터가 표현된 문서들로부터 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다.Therefore, the structured data can be generated through the structure as shown in FIG. FIG. 3 is a diagram for describing a structure for generating structured data from documents in which data is represented in different ways.

크롤러(310)는 제1 문서 제공 소스(11), 제2 문서 제공 소스(12) 및 제3 문서 제공 소스(13)로부터 각각 제1 문서(21), 제2 문서(22) 및 제3 문서(23)를 수집할 수 있다.The crawler 310 receives the first document 21, the second document 22, and the third document from the first document serving source 11, the second document serving source 12, and the third document serving source 13, respectively. (23) can be collected.

컴퓨팅 장치는 획득된 제1 문서(21)를 제1 파서(321)에 로드함으로써 제1 정형 데이터(31)를 생성할 수 있다. 또한, 컴퓨팅 장치는 획득된 제2 문서(22)를 제2 파서(322)에 로드함으로써 제2 정형 데이터(32)를 생성할 수 있다. 또한, 컴퓨팅 장치는 제3 문서(23)를 제3 파서(323)에 로드함으로써 제3 정형 데이터(33)를 생성할 수 있다. 따라서, 이 경우 각각의 문서의 데이터 표현 방식에 따른 파서들이 각각 개발되어야 한다. 또한, 각 문서의 데이터 표현 방식이 변경되는 경우, 대응하는 파서 또한 다시 개발되어야 한다.The computing device may generate the first structured data 31 by loading the obtained first document 21 into the first parser 321. In addition, the computing device may generate the second structured data 32 by loading the obtained second document 22 into the second parser 322. In addition, the computing device may generate the third structured data 33 by loading the third document 23 into the third parser 323. Therefore, in this case, parsers must be developed according to the data representation of each document. In addition, if the data representation of each document changes, the corresponding parser must also be re-developed.

도 4는 웹페이지 분석을 위한 소스코드(400)의 예시를 도시한 도면이다. 도 4에 도시된 소스코드(400)는 컴퓨팅 장치가 doc.select()를 통해 문서 내에서 정해진 위치에 존재하는 텍스트를 추출하도록 구성되어 있다. 그러나 값을 추출하는 엘리먼트의 위치가 변경되는 경우, 소스코드(400)가 수정되어야 한다. 소스코드(400)가 수정되는 경우 수정된 소스코드를 다시 컴파일하고 컴퓨팅 장치에 적용해야 컴퓨팅 장치가 엘리먼트의 위치가 변경된 문서로부터 원하는 데이터를 추출할 수 있다. 또한, 추출할 데이터의 항목이 많아지거나 구조가 복잡해질수록 소스코드(400) 수정의 난이도가 크게 증가한다.4 is a diagram illustrating an example of the source code 400 for web page analysis. The source code 400 shown in FIG. 4 is configured such that the computing device extracts text existing at a predetermined position in the document through doc.select (). However, if the position of the element to extract the value is changed, the source code 400 must be modified. When the source code 400 is modified, the modified source code must be recompiled and applied to the computing device so that the computing device can extract desired data from the document in which the position of the element is changed. In addition, as the number of items of data to be extracted or the structure becomes more complicated, the difficulty of modifying the source code 400 increases.

도 5는 일 실시 예에 따른 컴퓨팅 장치(500)가 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다. 도 5는 일 실시 예를 설명하기 위한 것에 불과하며, 컴퓨팅 장치(500)의 구조나 구성요소의 수는 실시 예에 따라서 변경될 수 있다.5 is a diagram for describing a structure of the computing device 500 generating structured data, according to an exemplary embodiment. 5 is only for describing an embodiment, and the structure or the number of components of the computing device 500 may be changed according to an embodiment.

일 실시 예에 따른 컴퓨팅 장치(500)는 크롤러(510), 프로세서(520) 및 파라미터 데이터베이스(530)를 포함할 수 있다. 크롤러(510)는 문서 제공 소스들(11, 12, 13)로부터 각각 문서들(21, 22, 23)을 획득할 수 있다. 일 실시 예에 따르면, 크롤러(510)는 미리 설정된 URL(Uniform Resource Locator)에 접속함으로써 문서들(21, 22, 23)을 획득할 수 있다.The computing device 500 according to an embodiment may include a crawler 510, a processor 520, and a parameter database 530. The crawler 510 may obtain the documents 21, 22, 23 from the document serving sources 11, 12, 13, respectively. According to an embodiment of the present disclosure, the crawler 510 may acquire the documents 21, 22, and 23 by accessing a predetermined Uniform Resource Locator (URL).

파라미터 데이터베이스(530)는 데이터 추출 파라미터를 저장할 수 있다. 여기서, 데이터 추출 파라미터는 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰에 대한 설정값을 의미한다. 여기서, 데이터 추출 파라미터는 문서 제공 소스에 대한 정보(예를 들어, URL)와 매칭되어 저장될 수 있다. 즉, 데이터를 추출하는 규칙, 추출된 데이터를 저장하는 포맷 및 문서 제공 소스에 대한 정보가 하나의 데이터 세트로 파라미터 데이터베이스(530)에 저장될 수 있다.The parameter database 530 may store data extraction parameters. Here, the data extraction parameter means a setting value for a data generation rule that defines a rule for extracting data from a document and a format for storing the extracted data. Here, the data extraction parameter may be stored in match with information (eg, a URL) about a document providing source. That is, the rules for extracting data, the format for storing the extracted data, and information about the document providing source may be stored in the parameter database 530 as one data set.

프로세서(520)는 문서 제공 소스에 기초하여 데이터 추출 파라미터를 선택할 수 있다. 예를 들어, 제1 문서(21)로부터 데이터를 추출하고자 할 경우, 프로세서(520)는 제1 문서 제공 소스(11)에 대한 URL을 파라미터 데이터베이스(530)에서 조회하고, 제1 문서 제공 소스(11)에 대한 URL과 하나의 데이터 세트로 구성된 데이터 추출 파라미터를 선택할 수 있다.The processor 520 may select data extraction parameters based on the document providing source. For example, when extracting data from the first document 21, the processor 520 queries the parameter database 530 for the URL for the first document providing source 11, and the first document providing source ( 11) You can select a data extraction parameter consisting of a URL and one data set.

프로세서(520)는 선택된 데이터 추출 파라미터에 기초하여 데이터 생성 룰을 구성할 수 있다. 예를 들어, 프로세서(520)는 데이터 추출 파라미터에 기초하여 제1 문서(21)의 어느 위치에 존재하는 데이터를 추출하고, 추출된 데이터를 구조화된 제1 정형 데이터(31)의 어느 항목에 저장할지를 정의하는 데이터 생성 룰을 구성할 수 있다. 프로세서(520)는 구성된 데이터 생성 룰에 따라서 제1 문서(21)로부터 데이터를 추출하고, 데이터 생성 룰에 따라 구조화된 제1 정형 데이터(31)를 생성할 수 있다.The processor 520 may configure a data generation rule based on the selected data extraction parameter. For example, the processor 520 extracts data existing at any position of the first document 21 based on the data extraction parameter, and stores the extracted data in any item of the structured first structured data 31. You can configure data generation rules that define whether or not. The processor 520 may extract data from the first document 21 according to the configured data generation rule and generate the first structured data 31 structured according to the data generation rule.

도 5에 도시된 바와 같이 파라미터 데이터베이스에 파라미터들을 별도로 저장하여 두고, 프로세서(520)는 각 문서들에 대해 개별적으로 구성된 파서가 아니라 데이터 생성 룰을 이용하여 각 문서에 적합한 파서를 스스로 구성하여 정형 데이터를 생성할 수 있다. 또한, 문서 제공 소스에서 제공되는 문서의 데이터 표현 방식이 변경되는 경우에도, 컴퓨팅 장치(500)의 동작을 종료하고 다시 컴파일을 수행할 필요 없이, 파라미터 변경 프로그램을 이용하여 파라미터 데이터베이스에 저장된 데이터 추출 파라미터만을 수정함으로써, 컴퓨팅 장치(500)는 변경된 데이터 표현 방식에 따라 데이터를 추출할 수 있다.As shown in FIG. 5, the parameters are separately stored in the parameter database, and the processor 520 configures a parser suitable for each document by using data generation rules, rather than a parser configured individually for each document. Can be generated. In addition, even when the data presentation method of the document provided from the document providing source is changed, the data extraction parameter stored in the parameter database using the parameter change program, without having to end the operation of the computing device 500 and perform a compilation again. By modifying only the computing device 500, the computing device 500 may extract data according to the changed data representation.

도 6은 일 실시 예에 따라 컴퓨팅 장치가 문서로부터 정형 데이터를 생성하는 프로세스를 도시한 순서도이다.6 is a flowchart illustrating a process of generating structured data from a document by a computing device according to an embodiment.

먼저, 컴퓨팅 장치는 문서 제공 소스로부터 문서를 획득할 수 있다(S610). 여기서, 문서 제공 소스는 컴퓨팅 장치 외부의 장치일 수 있다. 예를 들어, 컴퓨팅 장치는 미리 설정된 URL에 대해 크롤링을 수행하거나 웹 문서 파일을 다운로드함으로써 문서를 획득할 수 있다.First, the computing device may obtain a document from a document providing source (S610). Here, the document providing source may be a device external to the computing device. For example, the computing device may obtain a document by crawling a preset URL or by downloading a web document file.

이후, 컴퓨팅 장치는 획득된 문서 제공 소스에 상응하는 데이터 추출 파라미터를 파라미터 데이터베이스로부터 획득할 수 있다(S620). 여기서, 데이터 추출 파라미터는 문서 제공 소스의 URL과 데이터 추출 파라미터를 매칭하여 저장할 수 있다. 따라서, 컴퓨팅 장치는 문서를 획득된 문서 제공 소스의 URL을 이용하여 파라미터 데이터베이스로부터 그에 매칭된 데이터 추출 파라미터를 조회하고 획득할 수 있다.Thereafter, the computing device may obtain a data extraction parameter corresponding to the obtained document providing source from the parameter database (S620). Here, the data extraction parameter may be stored by matching the URL of the document providing source with the data extraction parameter. Accordingly, the computing device may query and obtain the data extraction parameters matched thereto from the parameter database using the URL of the document providing source obtained from the document.

이후, 컴퓨팅 장치는 획득된 파라미터를 이용하여 데이터를 추출하고, 추출된 데이터를 미리 정해진 포맷으로 저장하는 설정값을 정의한 데이터 생성 룰을 구성할 수 있다(S630). 컴퓨팅 장치는 구성된 데이터 생성 룰에 따라 데이터를 추출하고, 추출된 데이터를 미리 설정된 항목에 저장함으로써 구조화된 정형 데이터를 추출할 수 있다(S640).Thereafter, the computing device may configure a data generation rule that extracts data using the obtained parameters and defines a setting value for storing the extracted data in a predetermined format (S630). The computing device may extract the structured data by extracting the data according to the configured data generation rule and storing the extracted data in a preset item (S640).

도 7은 일 실시 예에 따라 컴퓨팅 장치가 정형 데이터를 추출하는 프로세스를 도시한 순서도이다.7 is a flowchart illustrating a process of extracting structured data by a computing device according to an embodiment.

일 실시 예에 따르면, 데이터를 추출하는 규칙은 문서 내에서 선택될 대상인 데이터의 위치를 지시하는 선택자(예를 들어, CSS 선택자(selector))를 포함할 수 있다. 컴퓨팅 장치는 먼저 선택자를 이용하여 중간 데이터를 선택할 수 있다(S710).According to an embodiment of the present disclosure, the rule for extracting data may include a selector (eg, a CSS selector) indicating a position of data to be selected in the document. The computing device may first select intermediate data using the selector (S710).

이후, 컴퓨팅 장치는 중간 데이터를 분류함으로써 정형 데이터에 포함될 데이터를 선택할 수 있다(S720). 단계 S720에서, 컴퓨팅 장치는 정규 표현식(Regular Expression), 기계학습을 기반으로 생성된 분류 모델 및 텍스트 분류 알고리즘 중 하나 이상을 중간 데이터에 적용함으로써 중간 데이터를 분류할 수 있다.Thereafter, the computing device may select data to be included in the structured data by classifying the intermediate data (S720). In operation S720, the computing device may classify the intermediate data by applying one or more of a regular expression, a classification model generated based on machine learning, and a text classification algorithm to the intermediate data.

이후, 컴퓨팅 장치는 중간 데이터의 분류 결과에 따라 데이터 생성 룰에 정의된 항목에 저장함으로써 정형 데이터를 생성할 수 있다(S730). 도 7에 도시된 바와 같이 먼저 선택자를 이용하여 중간 데이터를 선택한 후, 중간 데이터를 분류함으로써, 컴퓨팅 장치가 문서 전체에 포함된 모든 데이터를 분류하기 위해 발생하는 로드를 감소시킬 수 있다.Thereafter, the computing device may generate the structured data by storing the data in an item defined in the data generation rule according to the classification result of the intermediate data (S730). As shown in FIG. 7, by first selecting intermediate data using a selector and then classifying the intermediate data, the load generated by the computing device to classify all data included in the entire document may be reduced.

도 8은 다른 실시 예에 따라 컴퓨팅 장치가 데이터 생성 룰을 선택하고, 정형 데이터를 추출하는 프로세스를 도시한 순서도이다. 8 is a flowchart illustrating a process in which a computing device selects a data generation rule and extracts structured data according to another embodiment.

다른 실시 예에 따르면, 컴퓨팅 장치는 파라미터 데이터베이스에 저장된 데이터 추출 파라미터에 기초하여 룰 목록을 구성하고, 구성된 룰 목록을 출력할 수 있다(S810). 룰 목록은 데이터 추출 파라미터들을 이용하여 구성된 데이터 생성 룰들의 집합일 수 있다. 예를 들면, 컴퓨팅 장치는 도 9에 도시된 바와 같이 데이터 생성 룰들을 모은 룰 목록(900)을 디스플레이 장치를 통해 출력할 수 있다. 도 9를 참조하면 룰 목록(900)의 각 항목(910)은 문서 제공 소스에 대한 정보(911)와 데이터 생성 룰이 매칭된 하나의 데이터 세트를 나타낼 수 있다. 여기서, 데이터 생성 룰은 데이터 추출 규칙(912) 및 데이터 저장 포맷(913)을 포함할 수 있다.According to another embodiment, the computing device may configure a rule list based on the data extraction parameters stored in the parameter database and output the configured rule list (S810). The rule list may be a set of data generation rules configured using data extraction parameters. For example, as illustrated in FIG. 9, the computing device may output a rule list 900 including data generation rules through a display device. Referring to FIG. 9, each item 910 of the rule list 900 may represent one data set in which information 911 about the document providing source and data generation rule are matched. Here, the data generation rule may include a data extraction rule 912 and a data storage format 913.

다시 도 8을 참조하면, 컴퓨팅 장치는 데이터 생성 룰을 선택할 수 있다(S820). 일 실시 예에 따르면, 컴퓨팅 장치는 룰 목록에서 데이터 생성 룰을 선택하는 입력을 수신할 수 있다. 즉, 사용자에 의해 룰 목록으로부터 데이터를 추출하고 정형 데이터를 구성하기 위한 데이터 생성 룰이 선택될 수 있다. 다른 실시 예에 따르면, 컴퓨팅 장치는 문서를 획득한 정보 제공 소스의 정보에 기초하여 데이터 생성 룰을 선택할 수도 있다.Referring back to FIG. 8, the computing device may select a data generation rule (S820). According to an embodiment of the present disclosure, the computing device may receive an input for selecting a data generation rule from a rule list. That is, a data generation rule for extracting data from the rule list by the user and composing the structured data may be selected. According to another embodiment, the computing device may select a data generation rule based on the information of the information providing source that obtained the document.

이후, 컴퓨팅 장치는 선택된 데이터 생성 룰에 따라 문서로부터 구조화된 정형 데이터를 추출할 수 있다(S830).Thereafter, the computing device may extract structured structured data from the document according to the selected data generation rule (S830).

지금까지 설명된 본 발명의 실시예에 따른 방법들은 컴퓨터가 읽을 수 있는 코드로 구현된 컴퓨터프로그램의 실행에 의하여 수행될 수 있다. 상기 컴퓨터프로그램은 인터넷 등의 네트워크를 통하여 제1 컴퓨팅 장치로부터 제2 컴퓨팅 장치에 전송되어 상기 제2 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 제2 컴퓨팅 장치에서 사용될 수 있다. 상기 제1 컴퓨팅 장치 및 상기 제2 컴퓨팅 장치는, 서버 장치, 클라우드 서비스를 위한 서버 풀에 속한 물리 서버, 데스크탑 피씨와 같은 고정식 컴퓨팅 장치를 모두 포함한다.The methods according to the embodiments of the present invention described so far may be performed by execution of a computer program implemented in computer readable code. The computer program may be transmitted to and installed on the second computing device from the first computing device via a network such as the Internet, and thus may be used in the second computing device. The first computing device and the second computing device include both a server device, a physical server belonging to a server pool for cloud services, and a stationary computing device such as a desktop PC.

상기 컴퓨터프로그램은 DVD-ROM, 플래시 메모리 장치 등의 기록매체에 저장된 것일 수도 있다.The computer program may be stored in a recording medium such as a DVD-ROM or a flash memory device.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that. Therefore, it is to be understood that the embodiments described above are exemplary in all respects and not restrictive.

Claims

A method for a computing device to generate structured data from a document, the method comprising:
Obtaining a plurality of documents formed in different structures according to different data presentation methods from a plurality of document providing sources;
Obtaining data extraction parameters corresponding to the plurality of document providing sources, respectively, from a parameter database;
Constructing data generation rules respectively defining a rule for extracting data from the document and a format for storing the extracted data based on the data extraction parameters respectively corresponding to the plurality of document providing sources; And
And extracting structured structured data having the same structure from the plurality of documents in the same data representation manner according to the configured data generation rules.
Each step of configuring the data generation rule,
Changing the data generation rule by modifying the data extraction parameter to correspond to a format of the plurality of documents by using a parameter change program when the plurality of documents have different structures;
Extracting the structured data,
And constructing a parser suitable for each of the plurality of documents formed in different structures using the data generation rule changed by the modified data extraction parameter.
How to generate data.

The method of claim 1,
The rule for extracting the data is
And a selector indicative of the location of the data within the document.
How to generate data.

The method of claim 2,
The extracting step,
Selecting intermediate data to be included in the structured structured data from the document based on the selector; And
And classifying the intermediate data according to the data generation rule and generating the structured data based on the classification of the intermediate data.
How to generate data.

The method of claim 3,
Generating the structured data,
Characterized in that to classify the intermediate data using at least one of a regular expression, a machine learning based classification model, and a text extraction algorithm.
How to generate data.

The method of claim 1,
Acquiring the document,
Obtaining a web document based on the Uniform Resource Locator (URL) of the document providing source,
The parameter database,
And matching and storing the extraction parameter and the URL,
Acquiring the extraction parameter,
Retrieving the extraction parameter based on the URL;
How to generate data.

The method of claim 1,
The data generation method,
Outputting a rule list including the data generation rule; And
Receiving an input for selecting the data generation rule from the list;
Extracting the data,
And extracting the structured data based on the selected data generation rule.
How to generate data.

The method of claim 1,
Acquiring the document,
Crawling a web document from a preset URL,
How to generate data.

The method of claim 1,
Acquiring the document,
Obtaining a web document providing vulnerability information,
The vulnerability information,
Characterized by the type of vulnerability, product name of the vulnerability, the identifier of the vulnerability and the date the vulnerability occurred, characterized in that
How to generate data.

In a computing device,
A crawler for obtaining a plurality of documents formed in different structures according to different data representation methods obtained from a plurality of external document providing sources;
A parameter database for storing a data extraction parameter, which is a setting value for a data generation rule for defining a rule for extracting data from the document and a format for storing the extracted data, for the plurality of documents; And
Respectively selecting the data extraction parameters based on the plurality of document providing sources, respectively configuring the data generation rules based on the respective selected data extraction parameters, and applying the same from the plurality of documents in accordance with the respective configured data generation rules. A processor configured to extract structured structured data having the same structure according to the data representation method;
The processor,
When the plurality of documents are formed in different structures, the data extraction parameter is modified by modifying the data extraction parameters so as to correspond to the format of the plurality of documents using a parameter change program, and by the modified data extraction parameters. By using the changed data generation rules to configure a parser suitable for each of the plurality of documents formed in different structures,
Computing device.

A computer program recorded on a non-transitory computer readable medium, where instructions of the computer program are executed by a processor of a server,
Obtaining a plurality of documents formed in different structures according to different data presentation methods from a plurality of document providing sources;
Obtaining data extraction parameters respectively corresponding to the plurality of document providing sources from a parameter database;
Constructing data generation rules for defining a rule for extracting data from the document and a format for storing the extracted data for the plurality of documents based on the obtained data extraction parameters, respectively; And
And extracting the structured structured data having the same structure from the plurality of documents according to the same data expression scheme according to each of the configured data generation rules.
Each step of configuring the data generation rule,
Changing the data generation rule by modifying the data extraction parameter to correspond to a format of the plurality of documents by using a parameter change program when the plurality of documents have different structures;
Extracting the structured data,
And constructing a parser suitable for each of the plurality of documents formed in different structures using the data generation rule changed by the modified data extraction parameter.
Computer programs.