KR20190058141A

KR20190058141A - Method for generating data extracted from document and apparatus thereof

Info

Publication number: KR20190058141A
Application number: KR1020170155836A
Authority: KR
Inventors: 김기수; 이종수; 김환국; 김태은; 장대일; 유창훈; 손영남; 고은혜; 나사랑
Original assignee: 주식회사 루테스
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2019-05-29
Also published as: KR102033416B1

Abstract

Provided are a method for extracting necessary data from an externally collected document and generating structured typical data, and an apparatus thereof. According to one embodiment of the present invention, a method for generating data comprises the steps of: obtaining the document from a document providing source; obtaining a data extraction parameter corresponding to the document providing source from a parameter database; constructing a rule for extracting data from the document based on the data extraction parameter and a data generation rule for defining a format for storing the extracted data; and extracting the structured typical data from the document in accordance with the data generation rule.

Description

METHOD FOR GENERATING DATA EXTRACTED FROM DOCUMENT AND APPARATUS THEREOF FIELD OF THE INVENTION [0001]

본 발명은 문서로부터 구조화된 정형 데이터를 생성하는 방법 및 그 장치에 관한 것이다. 보다 자세하게는, 외부로부터 수집된 문서로부터 필요로 하는 데이터를 추출하고 구조화된 정형 데이터를 생성하는 방법에 관한 것이다.The present invention relates to a method and apparatus for generating structured form data from a document. More particularly, the present invention relates to a method for extracting necessary data from an externally collected document and generating structured formal data.

통신 및 컴퓨팅 기술의 발달로 인터넷을 통해 수많은 문서들이 제공되고 있다. 통신망을 통해 제공되는 문서들로부터 필요한 데이터를 얻기 위해 크롤링(crawling) 또는 스크랩핑(scraping) 등의 기술이 이용된다. 크롤링(crawling) 또는 스크랩핑(scraping) 등의 기술은 웹 문서를 획득하고, 획득된 웹 문서로부터 데이터를 추출해 내는 기술이다. 예를 들어 포털 서비스의 검색엔진은 웹의 정보를 수집하여 인덱싱함으로써 검색 서비스를 제공할 수 있다. 여기서, 웹의 정보를 수집하고 데이터를 추출하는 구성은 스파이더(spider), 봇(bot) 또는 지능 에이전트 등으로 언급된다.With the development of communication and computing technologies, numerous documents are being provided through the Internet. Techniques such as crawling or scraping are used to obtain the necessary data from the documents provided through the network. Techniques such as crawling or scraping are techniques for acquiring web documents and extracting data from the obtained web documents. For example, a search engine of a portal service can provide a search service by collecting and indexing information of the web. Here, the configuration for collecting information of the web and extracting data is referred to as a spider, a bot or an intelligent agent.

인터넷 등을 통해 수집되는 수많은 문서들로부터 원하는 데이터를 추출하고 데이터베이스화할 필요가 있다. 예를 들어, 다양한 웹 사이트를 통해 제공되는 소프트웨어 제품이나 네트워크 디바이스 등의 보안 취약점 정보를 수집하고 데이터베이스로 구성할 필요가 있다. 이를 위해, 컴퓨팅 장치는 문서들을 크롤링을 통해 수집하고, 문서를 분석하는 모듈(예를 들어, 파서(parser))을 구비하는 분석 장치가 문서로부터 필요한 데이터를 추출하고 분석한다. 여기서, 문서를 분석하는 모듈은 문서의 데이터 표출 방식에 따라 데이터를 추출하고 분석한다. 예를 들어, 문서를 분석하는 모듈은 하이픈(-) 뒤에 위치한 데이터를 추출하도록 구성될 수 있다.It is necessary to extract desired data from a large number of documents collected through the Internet or the like and to make a database. For example, it is necessary to collect security vulnerability information such as a software product or a network device provided through various websites and configure it as a database. To this end, the computing device collects the documents through crawling, and the analytical device having a module (e.g., a parser) that analyzes the documents extracts and analyzes the necessary data from the documents. Here, the module for analyzing the document extracts and analyzes the data according to the data presentation method of the document. For example, a module for parsing a document may be configured to extract data located after the hyphen (-).

그러나, 이와 같이 문서의 데이터 표출 방식에 종속된 분석 모듈을 이용하여 데이터를 추출하는 경우, 다른 소스로부터 문서를 제공받거나 소스가 제공하는 문서의 포맷이 변경됨으로써 문서의 데이터 표출 방식이 변경될 때마다 분석 모듈의 재개발이 불가피하다. 또한, 재개발된 분석 모듈을 적용하기 위해 문서를 수집하고 분석하는 컴퓨팅 장치의 동작을 종료해야만 하는 문제가 있다.However, in the case of extracting data using an analysis module that is dependent on the data presentation method of the document, when a document is provided from another source or the format of the document provided by the source is changed, Redevelopment of the analysis module is inevitable. Further, there is a problem in that the operation of the computing device for collecting and analyzing the document to apply the redeveloped analysis module must be terminated.

본 발명이 해결하고자 하는 기술적 과제는, 수집되는 문서의 형태나 표현 구조가 변경되더라도 다양한 문서로부터 수집되는 문서들을 분석하는 모듈을 재개발을 최소화할 수 있는 방법 및 그 장치에 관한 것이다.SUMMARY OF THE INVENTION The present invention is directed to a method and apparatus for minimizing redevelopment of a module for analyzing documents collected from various documents even if the type or structure of the document to be collected is changed.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The technical objects of the present invention are not limited to the above-mentioned technical problems, and other technical subjects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 데이터 생성 방법은, 문서 제공 소스로부터 상기 문서를 획득하는 단계, 파라미터 데이터베이스로부터 상기 문서 제공 소스에 상응하는 데이터 추출 파라미터를 획득하는 단계, 상기 데이터 추출 파라미터에 기초하여 상기 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰을 구성하는 단계 및 상기 데이터 생성 룰에 따라 상기 문서로부터 구조화된 정형 데이터를 추출하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a method of generating data, the method comprising: obtaining the document from a document providing source; obtaining data extraction parameters corresponding to the document providing source from the parameter database; Constructing a rule for extracting data from the document based on the data extraction parameter and a data generation rule for defining a format for storing the extracted data and extracting the structured form data from the document in accordance with the data generation rule Step < / RTI >

또한, 다른 실시 예에 따르면, 상기 데이터를 추출하는 규칙은 상기 문서 내에서 데이터의 위치를 지시하는 선택자를 포함하는 것을 특징으로 할 수 있다.According to another embodiment, the rule for extracting the data may include a selector for indicating a position of data in the document.

또한, 또 다른 실시 예에 따르면, 상기 추출하는 단계는 상기 선택자에 기초하여 상기 문서로부터 중간 데이터를 선택하는 단계 및 상기 중간 데이터를 분류하고, 상기 중간 데이터의 분류에 기초하여 상기 정형 데이터를 생성하는 단계를 포함할 수 있다.According to still another embodiment, the extracting step includes the steps of selecting intermediate data from the document based on the selector, classifying the intermediate data, and generating the format data based on the classification of the intermediate data Step < / RTI >

또한, 또 다른 실시 예에 따르면, 상기 정형 데이터를 생성하는 단계는 정규 표현식(Regular Expression), 기계학습 기반 분류 모델 및 텍스트 추출 알고리즘 중 적어도 하나를 이용하여 상기 중간 데이터를 분류하는 것을 특징으로 할 수 있다.According to another embodiment of the present invention, the step of generating the format data may include the step of classifying the intermediate data using at least one of a regular expression, a machine learning based classification model, and a text extraction algorithm. have.

또한, 또 다른 실시 예에 따르면, 상기 문서를 획득하는 단계는 상기 문서 제공 소스의 URL(Uniform Resource Locator)에 기초하여 웹 문서를 획득하는 것을 특징으로 하고, 상기 파라미터 데이터베이스는 상기 추출 파라미터와 상기 URL을 매칭하여 저장하는 것을 특징으로 하며, 상기 추출 파라미터를 획득하는 단계는 상기 URL에 기초하여 상기 추출 파라미터를 검색하는 단계를 포함할 수 있다.According to yet another embodiment, the step of acquiring the document is characterized by obtaining a web document based on a URL (Uniform Resource Locator) of the document providing source, wherein the parameter database stores the extracted parameter and the URL And the step of acquiring the extraction parameter may include searching the extraction parameter based on the URL.

또한, 또 다른 실시 예에 따르면, 상기 데이터 생성 방법은 상기 데이터 생성 룰을 포함하는 룰 목록을 출력하는 단계 및 상기 목록으로부터 상기 데이터 생성 룰을 선택하는 입력을 수신하는 단계를 더 포함하고, 상기 데이터를 추출하는 단계는 상기 선택된 데이터 생성 룰에 기초하여 상기 정형 데이터를 추출하는 것을 특징으로 할 수 있다.According to yet another embodiment, the data generation method further comprises the steps of outputting a rule list including the data generation rule and receiving an input for selecting the data generation rule from the list, Extracting the format data based on the selected data generation rule.

또한, 또 다른 실시 예에 따르면, 상기 문서를 획득하는 단계는 미리 설정된 URL로부터 웹 문서를 크롤링(crawling)하는 것을 특징으로 할 수 있다.According to another embodiment of the present invention, the step of acquiring the document may include crawling a web document from a predetermined URL.

또한, 또 다른 실시 예에 따르면, 상기 문서를 획득하는 단계는 취약점 정보를 제공하는 웹 문서를 획득하는 것을 특징으로 하고, 상기 취약점 정보는 취약점 종류, 취약점 발생 제품명, 취약점 식별자 및 취약점 발생일에 대한 정보를 포함하는 것을 특징으로 할 수 있다.According to another embodiment of the present invention, the step of acquiring the document includes acquiring a web document providing vulnerability information, wherein the vulnerability information includes at least one of a vulnerability type, a product name of a vulnerability, a vulnerability identifier, And information on the information.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 컴퓨팅 장치는, 외부의 문서 제공 소스로부터 획득된 문서를 획득하는 크롤러(crawler), 상기 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰에 대한 설정값인 데이터 추출 파라미터를 저장하는 파라미터 데이터베이스, 상기 문서 제공 소스에 기초하여 상기 데이터 추출 파라미터를 선택하고, 상기 데이터 추출 파라미터에 기초하여 상기 데이터 생성 룰을 구성하며, 상기 데이터 생성 룰에 따라 상기 문서로부터 구조화된 정형 데이터를 추출하도록 구성된 프로세서를 포함할 수 있다.According to an aspect of the present invention, there is provided a computing device including a crawler for obtaining a document obtained from an external document providing source, a rule for extracting data from the document, A parameter database for storing a data extraction parameter that is a setting value for a data generation rule that defines a format to be stored; a selection unit for selecting the data extraction parameter based on the document providing source, And configured to extract structured form data from the document in accordance with the data generation rules.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 비일시적(non-transitory) 컴퓨터 판독 가능한 매체에 기록된 컴퓨터 프로그램은, 컴퓨터 프로그램의 명령어들이 서버의 프로세서에 의해 실행되는 경우에, 문서 제공 소스로부터 상기 문서를 획득하는 단계, 파라미터 데이터베이스로부터 상기 문서 제공 소스에 상응하는 데이터 추출 파라미터를 획득하는 단계, 상기 데이터 추출 파라미터에 기초하여 상기 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰을 구성하는 단계 및 상기 데이터 생성 룰에 따라 상기 문서로부터 구조화된 정형 데이터를 추출하는 단계를 포함하는 동작이 수행되는 것을 특징으로 할 수 있다.In order to solve the above-mentioned technical problem, a computer program recorded on a non-transitory computer-readable medium according to an embodiment of the present invention, when the instructions of a computer program are executed by a processor of a server, Acquiring the document from the providing source, obtaining a data extraction parameter corresponding to the document providing source from the parameter database, storing rules for extracting data from the document based on the data extraction parameter, A step of constructing a data generation rule for defining a format, and an operation for extracting structured format data from the document in accordance with the data generation rule.

도 1 및 도 2는 서로 다른 문서 제공 소스로부터 제공되는 문서의 예시를 도시한 도면이다.
도 3은 서로 다른 방식으로 데이터가 표현된 문서들로부터 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다.
도 4는 웹페이지 분석을 위한 소스코드의 예시를 도시한 도면이다.
도 5는 일 실시 예에 따른 컴퓨팅 장치가 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다.
도 6은 일 실시 예에 따라 컴퓨팅 장치가 문서로부터 정형 데이터를 생성하는 프로세스를 도시한 순서도이다.
도 7은 일 실시 예에 따라 컴퓨팅 장치가 정형 데이터를 추출하는 프로세스를 도시한 순서도이다.
도 8은 다른 실시 예에 따라 컴퓨팅 장치가 데이터 생성 룰을 선택하고, 정형 데이터를 추출하는 프로세스를 도시한 순서도이다.
도 9는 일 실시 예에 따른 룰 목록 및 데이터 생성 룰을 설명하기 위한 예시를 도시한 개념도이다.Figures 1 and 2 are illustrations of examples of documents provided from different document provisioning sources.
3 is a diagram for explaining a structure for generating format data from documents in which data is represented in different ways.
Figure 4 is an illustration of an example of source code for web page analysis.
5 is a diagram for explaining a structure in which a computing device according to an embodiment generates the formatted data.
6 is a flowchart illustrating a process by which a computing device generates structured data from a document in accordance with one embodiment.
7 is a flow diagram illustrating a process by which a computing device extracts formatted data according to one embodiment.
FIG. 8 is a flow chart illustrating a process for a computing device to select data generation rules and extract formal data according to another embodiment.
9 is a conceptual diagram showing an example for explaining a rule list and a data generation rule according to an embodiment.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise. The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification.

이하, 도면들을 참조하여 본 발명의 몇몇 실시예들을 설명한다.Some embodiments of the present invention will now be described with reference to the drawings.

도 1 및 도 2는 서로 다른 문서 제공 소스로부터 제공되는 문서의 예시를 도시한 도면이다. Figures 1 and 2 are illustrations of examples of documents provided from different document provisioning sources.

도 1은 NVD(Natinoal Vulnerability Database)에서 제공하는 취약점 정보인 CVE(Common Vulnerabilities and Exposures) 정보를 포함하는 웹페이지의 예시이다. CVE 정보는 CVE-ID(110), 취약점 개요 정보(Overview)(120), CVSS(130), CPE(140), CWE(150), 및 Reference(160)를 전부 또는 일부 포함한다. 취약점 개요 정보 (120)는 "place where a vulnerability was discovered", "(in) related software product names", "(when)conditions of the vulnerability occurrence", "(allow)attacker type", "(to)results of attack", "(via)means of attack", "(aka)vulnerability title in the reference site", "(a different vulnerability than)other CVE-IDs" 등으로 구성될 수 있다.1 is an example of a web page including CVE (Common Vulnerabilities and Exposures) information, which is vulnerability information provided by NVD (Natinoal Vulnerability Database). The CVE information includes all or some of CVE-ID 110, Vulnerability Overview 120, CVSS 130, CPE 140, CWE 150, and Reference 160. Vulnerability outline information 120 may include information such as "place where a vulnerability was discovered", "(in) related software product names", "when conditions of the vulnerability occurrence" , "(aka) vulnerability title in the reference site", "(a) different vulnerability than other CVE-IDs".

도 2는 다른 웹 페이지(200)를 통해 제공되는 취약점 정보의 예시이다. 도 2에 도시된 예시에 따르면, 취약점 정보(210)는 취약점 식별자(Bugtraq ID; B-ID), 취약점의 종류(class), CVE-ID(CVE), 원격/로컬 정보(Remote, Local), 게시일, 취약 가능한 제품명(Vulnerable)을 포함하고 있다. 또한, 웹페이지(200)는 다른 취약점 정보로서 제목(260), 논의사항(discussion)(220), 익스플로잇 정보(exploit)(230), 솔루션(solution)(240), 참조(reference)(250) 등을 더 포함할 수 있다.FIG. 2 is an illustration of vulnerability information provided through another web page 200. FIG. 2, the vulnerability information 210 includes a vulnerability identifier (B-ID), a type of vulnerability, a CVE-ID (CVE), a remote / local information Publication date, and a vulnerable product name (Vulnerable). The web page 200 may further include a title 260, a discussion 220, an exploit 230, a solution 240, a reference 250, And the like.

도 1 및 도 2를 대비하면, 두 경우 모두 취약점 정보를 제공하는 웹페이지라 하더라도 그 정보의 구성이나 정보가 표시되는 위치들이 서로 상이한 것을 확인할 수 있다. 따라서, 도 1에 도시된 웹 페이지에서 CVE-ID 뒤에 표시되는 정보를 취약점에 대한 식별자로 추출하는 파서를 도 2에 도시된 웹 페이지(200)에 적용하는 경우, 상기 파서는 취약점에 대한 식별자를 추출하지 못하게 된다.In contrast to FIGS. 1 and 2, in both cases, even if a web page provides vulnerability information, the configuration of the information or the position at which the information is displayed are different from each other. Therefore, when a parser for extracting the information displayed after the CVE-ID in the web page shown in FIG. 1 as an identifier for the vulnerability is applied to the web page 200 shown in FIG. 2, the parser stores an identifier It will not be extracted.

따라서, 도 3에 도시된 바와 같은 구조를 통해서 정형 데이터를 생성할 수 있다. 도 3은 서로 다른 방식으로 데이터가 표현된 문서들로부터 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다.Therefore, the form data can be generated through the structure as shown in FIG. FIG. 3 is a diagram for explaining a structure for generating format data from documents in which data is expressed in different ways.

크롤러(310)는 제1 문서 제공 소스(11), 제2 문서 제공 소스(12) 및 제3 문서 제공 소스(13)로부터 각각 제1 문서(21), 제2 문서(22) 및 제3 문서(23)를 수집할 수 있다.The crawler 310 receives the first document 21, the second document 22 and the third document 22 from the first document providing source 11, the second document providing source 12 and the third document providing source 13, respectively, (23) can be collected.

컴퓨팅 장치는 획득된 제1 문서(21)를 제1 파서(321)에 로드함으로써 제1 정형 데이터(31)를 생성할 수 있다. 또한, 컴퓨팅 장치는 획득된 제2 문서(22)를 제2 파서(322)에 로드함으로써 제2 정형 데이터(32)를 생성할 수 있다. 또한, 컴퓨팅 장치는 제3 문서(23)를 제3 파서(323)에 로드함으로써 제3 정형 데이터(33)를 생성할 수 있다. 따라서, 이 경우 각각의 문서의 데이터 표현 방식에 따른 파서들이 각각 개발되어야 한다. 또한, 각 문서의 데이터 표현 방식이 변경되는 경우, 대응하는 파서 또한 다시 개발되어야 한다.The computing device may generate the first shaping data 31 by loading the acquired first document 21 into the first parser 321. [ The computing device may also generate the second orthogonal data 32 by loading the acquired second document 22 into the second parser 322. [ In addition, the computing device may generate the third formatted data 33 by loading the third document 23 into the third parser 323. Therefore, in this case, parsers according to the data presentation method of each document should be developed, respectively. Also, if the data representation of each document changes, the corresponding parser must also be redeveloped.

도 4는 웹페이지 분석을 위한 소스코드(400)의 예시를 도시한 도면이다. 도 4에 도시된 소스코드(400)는 컴퓨팅 장치가 doc.select()를 통해 문서 내에서 정해진 위치에 존재하는 텍스트를 추출하도록 구성되어 있다. 그러나 값을 추출하는 엘리먼트의 위치가 변경되는 경우, 소스코드(400)가 수정되어야 한다. 소스코드(400)가 수정되는 경우 수정된 소스코드를 다시 컴파일하고 컴퓨팅 장치에 적용해야 컴퓨팅 장치가 엘리먼트의 위치가 변경된 문서로부터 원하는 데이터를 추출할 수 있다. 또한, 추출할 데이터의 항목이 많아지거나 구조가 복잡해질수록 소스코드(400) 수정의 난이도가 크게 증가한다.4 is a diagram illustrating an example of source code 400 for web page analysis. The source code 400 shown in Fig. 4 is configured so that the computing device extracts text present at a predetermined location in the document via doc.select (). However, if the location of the element to extract the value changes, the source code 400 must be modified. If the source code 400 is modified, the modified source code must be recompiled and applied to the computing device so that the computing device can extract the desired data from the document whose element location has changed. Further, as the number of items of data to be extracted increases or the structure becomes complicated, the difficulty of the modification of the source code 400 increases greatly.

도 5는 일 실시 예에 따른 컴퓨팅 장치(500)가 정형 데이터를 생성하는 구조를 설명하기 위한 도면이다. 도 5는 일 실시 예를 설명하기 위한 것에 불과하며, 컴퓨팅 장치(500)의 구조나 구성요소의 수는 실시 예에 따라서 변경될 수 있다.FIG. 5 is a diagram for explaining a structure in which the computing device 500 according to the embodiment generates the formatted data. 5 is for illustrative purposes only, and the structure or number of components of the computing device 500 may vary depending on the embodiment.

일 실시 예에 따른 컴퓨팅 장치(500)는 크롤러(510), 프로세서(520) 및 파라미터 데이터베이스(530)를 포함할 수 있다. 크롤러(510)는 문서 제공 소스들(11, 12, 13)로부터 각각 문서들(21, 22, 23)을 획득할 수 있다. 일 실시 예에 따르면, 크롤러(510)는 미리 설정된 URL(Uniform Resource Locator)에 접속함으로써 문서들(21, 22, 23)을 획득할 수 있다.The computing device 500 in accordance with one embodiment may include a crawler 510, a processor 520, and a parameter database 530. The crawler 510 may obtain documents 21, 22, and 23 from the document providing sources 11, 12, and 13, respectively. According to one embodiment, the crawler 510 may obtain the documents 21, 22, 23 by accessing a predetermined URL (Uniform Resource Locator).

파라미터 데이터베이스(530)는 데이터 추출 파라미터를 저장할 수 있다. 여기서, 데이터 추출 파라미터는 문서로부터 데이터를 추출하는 규칙 및 추출된 데이터를 저장하는 포맷을 정의하는 데이터 생성 룰에 대한 설정값을 의미한다. 여기서, 데이터 추출 파라미터는 문서 제공 소스에 대한 정보(예를 들어, URL)와 매칭되어 저장될 수 있다. 즉, 데이터를 추출하는 규칙, 추출된 데이터를 저장하는 포맷 및 문서 제공 소스에 대한 정보가 하나의 데이터 세트로 파라미터 데이터베이스(530)에 저장될 수 있다.The parameter database 530 may store data extraction parameters. Here, the data extraction parameter means a setting value for a rule for extracting data from a document and a data generation rule for defining a format for storing the extracted data. Here, the data extraction parameter may be stored in a matching manner with information (e.g., URL) about the document providing source. That is, the rule for extracting data, the format for storing the extracted data, and the information about the document providing source may be stored in the parameter database 530 as one data set.

프로세서(520)는 문서 제공 소스에 기초하여 데이터 추출 파라미터를 선택할 수 있다. 예를 들어, 제1 문서(21)로부터 데이터를 추출하고자 할 경우, 프로세서(520)는 제1 문서 제공 소스(11)에 대한 URL을 파라미터 데이터베이스(530)에서 조회하고, 제1 문서 제공 소스(11)에 대한 URL과 하나의 데이터 세트로 구성된 데이터 추출 파라미터를 선택할 수 있다.Processor 520 may select data extraction parameters based on the document source. For example, when it is desired to extract data from the first document 21, the processor 520 queries the parameter database 530 for the URL for the first document providing source 11, 11) and a data extraction parameter composed of one data set can be selected.

프로세서(520)는 선택된 데이터 추출 파라미터에 기초하여 데이터 생성 룰을 구성할 수 있다. 예를 들어, 프로세서(520)는 데이터 추출 파라미터에 기초하여 제1 문서(21)의 어느 위치에 존재하는 데이터를 추출하고, 추출된 데이터를 구조화된 제1 정형 데이터(31)의 어느 항목에 저장할지를 정의하는 데이터 생성 룰을 구성할 수 있다. 프로세서(520)는 구성된 데이터 생성 룰에 따라서 제1 문서(21)로부터 데이터를 추출하고, 데이터 생성 룰에 따라 구조화된 제1 정형 데이터(31)를 생성할 수 있다.The processor 520 may configure the data generation rules based on the selected data extraction parameters. For example, the processor 520 extracts data existing at a certain position in the first document 21 based on the data extraction parameter, and stores the extracted data in an item of the first structured data 31 And a data generation rule for defining whether or not the data is to be generated. The processor 520 may extract data from the first document 21 according to the configured data generation rules and generate the first structured data 31 according to the data generation rules.

도 5에 도시된 바와 같이 파라미터 데이터베이스에 파라미터들을 별도로 저장하여 두고, 프로세서(520)는 각 문서들에 대해 개별적으로 구성된 파서가 아니라 데이터 생성 룰을 이용하여 각 문서에 적합한 파서를 스스로 구성하여 정형 데이터를 생성할 수 있다. 또한, 문서 제공 소스에서 제공되는 문서의 데이터 표현 방식이 변경되는 경우에도, 컴퓨팅 장치(500)의 동작을 종료하고 다시 컴파일을 수행할 필요 없이, 파라미터 변경 프로그램을 이용하여 파라미터 데이터베이스에 저장된 데이터 추출 파라미터만을 수정함으로써, 컴퓨팅 장치(500)는 변경된 데이터 표현 방식에 따라 데이터를 추출할 수 있다.As shown in FIG. 5, the parameters are separately stored in the parameter database, and the processor 520 constructs a parser suitable for each document by using a data generation rule, rather than a separately configured parser for each document, Lt; / RTI > Further, even when the data presentation method of the document provided by the document providing source is changed, the data extraction parameter stored in the parameter database by using the parameter changing program, without having to end the operation of the computing device 500 and perform the compile again The computing device 500 can extract data according to the changed data representation method.

도 6은 일 실시 예에 따라 컴퓨팅 장치가 문서로부터 정형 데이터를 생성하는 프로세스를 도시한 순서도이다.6 is a flowchart illustrating a process by which a computing device generates structured data from a document in accordance with one embodiment.

먼저, 컴퓨팅 장치는 문서 제공 소스로부터 문서를 획득할 수 있다(S610). 여기서, 문서 제공 소스는 컴퓨팅 장치 외부의 장치일 수 있다. 예를 들어, 컴퓨팅 장치는 미리 설정된 URL에 대해 크롤링을 수행하거나 웹 문서 파일을 다운로드함으로써 문서를 획득할 수 있다.First, the computing device may obtain a document from a document providing source (S610). Here, the document providing source may be a device external to the computing device. For example, the computing device can obtain a document by performing a crawl on a predetermined URL or by downloading a web document file.

이후, 컴퓨팅 장치는 획득된 문서 제공 소스에 상응하는 데이터 추출 파라미터를 파라미터 데이터베이스로부터 획득할 수 있다(S620). 여기서, 데이터 추출 파라미터는 문서 제공 소스의 URL과 데이터 추출 파라미터를 매칭하여 저장할 수 있다. 따라서, 컴퓨팅 장치는 문서를 획득된 문서 제공 소스의 URL을 이용하여 파라미터 데이터베이스로부터 그에 매칭된 데이터 추출 파라미터를 조회하고 획득할 수 있다.Thereafter, the computing device may obtain a data extraction parameter from the parameter database corresponding to the obtained document providing source (S620). Here, the data extraction parameter can be stored by matching the URL of the document providing source with the data extraction parameter. Thus, the computing device can query and obtain data extraction parameters matched to it from the parameter database using the URL of the document-providing source from which the document was obtained.

이후, 컴퓨팅 장치는 획득된 파라미터를 이용하여 데이터를 추출하고, 추출된 데이터를 미리 정해진 포맷으로 저장하는 설정값을 정의한 데이터 생성 룰을 구성할 수 있다(S630). 컴퓨팅 장치는 구성된 데이터 생성 룰에 따라 데이터를 추출하고, 추출된 데이터를 미리 설정된 항목에 저장함으로써 구조화된 정형 데이터를 추출할 수 있다(S640).Thereafter, the computing device may construct a data generation rule that extracts data using the acquired parameters and defines a set value for storing the extracted data in a predetermined format (S630). The computing device extracts structured formal data by extracting data according to the configured data generation rules and storing the extracted data in preset items (S640).

도 7은 일 실시 예에 따라 컴퓨팅 장치가 정형 데이터를 추출하는 프로세스를 도시한 순서도이다.7 is a flow diagram illustrating a process by which a computing device extracts formatted data according to one embodiment.

일 실시 예에 따르면, 데이터를 추출하는 규칙은 문서 내에서 선택될 대상인 데이터의 위치를 지시하는 선택자(예를 들어, CSS 선택자(selector))를 포함할 수 있다. 컴퓨팅 장치는 먼저 선택자를 이용하여 중간 데이터를 선택할 수 있다(S710).According to one embodiment, the rules for extracting data may include a selector (e.g., a CSS selector) indicating the location of the data to be selected in the document. The computing device may first select the intermediate data using the selector (S710).

이후, 컴퓨팅 장치는 중간 데이터를 분류함으로써 정형 데이터에 포함될 데이터를 선택할 수 있다(S720). 단계 S720에서, 컴퓨팅 장치는 정규 표현식(Regular Expression), 기계학습을 기반으로 생성된 분류 모델 및 텍스트 분류 알고리즘 중 하나 이상을 중간 데이터에 적용함으로써 중간 데이터를 분류할 수 있다.Thereafter, the computing device can select data to be included in the formatted data by sorting the intermediate data (S720). In step S720, the computing device may classify the intermediate data by applying one or more of a Regular Expression, a classification model generated based on machine learning, and a text classification algorithm to the intermediate data.

이후, 컴퓨팅 장치는 중간 데이터의 분류 결과에 따라 데이터 생성 룰에 정의된 항목에 저장함으로써 정형 데이터를 생성할 수 있다(S730). 도 7에 도시된 바와 같이 먼저 선택자를 이용하여 중간 데이터를 선택한 후, 중간 데이터를 분류함으로써, 컴퓨팅 장치가 문서 전체에 포함된 모든 데이터를 분류하기 위해 발생하는 로드를 감소시킬 수 있다.Thereafter, the computing device may generate the formatted data by storing the data in the item defined in the data generation rule according to the classification result of the intermediate data (S730). As shown in FIG. 7, by first selecting the intermediate data using the selector, and sorting the intermediate data, the load generated by the computing device to classify all the data included in the entire document can be reduced.

도 8은 다른 실시 예에 따라 컴퓨팅 장치가 데이터 생성 룰을 선택하고, 정형 데이터를 추출하는 프로세스를 도시한 순서도이다. FIG. 8 is a flow chart illustrating a process for a computing device to select data generation rules and extract formal data according to another embodiment.

다른 실시 예에 따르면, 컴퓨팅 장치는 파라미터 데이터베이스에 저장된 데이터 추출 파라미터에 기초하여 룰 목록을 구성하고, 구성된 룰 목록을 출력할 수 있다(S810). 룰 목록은 데이터 추출 파라미터들을 이용하여 구성된 데이터 생성 룰들의 집합일 수 있다. 예를 들면, 컴퓨팅 장치는 도 9에 도시된 바와 같이 데이터 생성 룰들을 모은 룰 목록(900)을 디스플레이 장치를 통해 출력할 수 있다. 도 9를 참조하면 룰 목록(900)의 각 항목(910)은 문서 제공 소스에 대한 정보(911)와 데이터 생성 룰이 매칭된 하나의 데이터 세트를 나타낼 수 있다. 여기서, 데이터 생성 룰은 데이터 추출 규칙(912) 및 데이터 저장 포맷(913)을 포함할 수 있다.According to another embodiment, the computing device constructs a rule list based on the data extraction parameters stored in the parameter database, and outputs the configured rule list (S810). The rule list may be a set of data generation rules configured using data extraction parameters. For example, the computing device may output the rule list 900, which is a collection of data generation rules, as shown in Fig. 9, through the display device. Referring to FIG. 9, each item 910 of the rule list 900 may represent one data set in which the information 911 for the document providing source and the data generating rule are matched. Here, the data generation rule may include a data extraction rule 912 and a data storage format 913.

다시 도 8을 참조하면, 컴퓨팅 장치는 데이터 생성 룰을 선택할 수 있다(S820). 일 실시 예에 따르면, 컴퓨팅 장치는 룰 목록에서 데이터 생성 룰을 선택하는 입력을 수신할 수 있다. 즉, 사용자에 의해 룰 목록으로부터 데이터를 추출하고 정형 데이터를 구성하기 위한 데이터 생성 룰이 선택될 수 있다. 다른 실시 예에 따르면, 컴퓨팅 장치는 문서를 획득한 정보 제공 소스의 정보에 기초하여 데이터 생성 룰을 선택할 수도 있다.Referring again to FIG. 8, the computing device may select a data generation rule (S820). According to one embodiment, the computing device may receive input to select a data generation rule from a rule list. That is, a data generation rule for extracting data from a rule list by a user and constructing formal data can be selected. According to another embodiment, the computing device may select a data generation rule based on the information of the information providing source from which the document was acquired.

이후, 컴퓨팅 장치는 선택된 데이터 생성 룰에 따라 문서로부터 구조화된 정형 데이터를 추출할 수 있다(S830).Thereafter, the computing device may extract the structured form data from the document according to the selected data generation rule (S830).

지금까지 설명된 본 발명의 실시예에 따른 방법들은 컴퓨터가 읽을 수 있는 코드로 구현된 컴퓨터프로그램의 실행에 의하여 수행될 수 있다. 상기 컴퓨터프로그램은 인터넷 등의 네트워크를 통하여 제1 컴퓨팅 장치로부터 제2 컴퓨팅 장치에 전송되어 상기 제2 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 제2 컴퓨팅 장치에서 사용될 수 있다. 상기 제1 컴퓨팅 장치 및 상기 제2 컴퓨팅 장치는, 서버 장치, 클라우드 서비스를 위한 서버 풀에 속한 물리 서버, 데스크탑 피씨와 같은 고정식 컴퓨팅 장치를 모두 포함한다.The methods according to the embodiments of the present invention described so far can be performed by the execution of a computer program embodied in computer readable code. The computer program may be transmitted from a first computing device to a second computing device via a network, such as the Internet, and installed in the second computing device, thereby enabling it to be used in the second computing device. The first computing device and the second computing device all include a server device, a physical server belonging to a server pool for cloud services, and a fixed computing device such as a desktop PC.

상기 컴퓨터프로그램은 DVD-ROM, 플래시 메모리 장치 등의 기록매체에 저장된 것일 수도 있다.The computer program may be stored in a recording medium such as a DVD-ROM, a flash memory device, or the like.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, I can understand that. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims

A method for a computing device to generate structured data from a document,
Obtaining the document from a document providing source;
Obtaining a data extraction parameter corresponding to the document providing source from a parameter database;
Configuring a data generation rule that defines a rule for extracting data from the document based on the data extraction parameter and a format for storing the extracted data; And
And extracting structured form data from the document in accordance with the data generation rules.
Data generation method.

The method according to claim 1,
The rules for extracting the data include,
And a selector for indicating the location of data in the document.
Data generation method.

3. The method of claim 2,
Wherein the extracting comprises:
Selecting intermediate data from the document based on the selector; And
Classifying the intermediate data and generating the formatted data based on the classification of the intermediate data.
Data generation method.

The method of claim 3,
Wherein the generating the formatted data comprises:
Wherein the intermediate data is classified using at least one of a regular expression, a machine learning-based classification model, and a text extraction algorithm.
Data generation method.

The method according to claim 1,
The step of acquiring the document comprises:
Characterized in that a web document is obtained based on a URL (Uniform Resource Locator) of the document providing source,
Wherein the parameter database comprises:
And the extraction parameter and the URL are matched and stored.
Wherein the step of acquiring the extraction parameter comprises:
And retrieving the extraction parameters based on the URL.
Data generation method.

The method according to claim 1,
The data generation method includes:
Outputting a rule list including the data generation rule; And
Receiving an input to select the data generation rule from the list,
The step of extracting the data comprises:
And extracts the format data based on the selected data generation rule.
Data generation method.

The method according to claim 1,
The step of acquiring the document comprises:
And crawling a web document from a predetermined URL.
Data generation method.

The method according to claim 1,
The step of acquiring the document comprises:
And acquiring a web document that provides vulnerability information,
The vulnerability information,
The type of the vulnerability, the product name of the vulnerability, the identifier of the vulnerability, and the date of occurrence of the vulnerability.
Data generation method.

A computing device comprising:
A crawler for obtaining a document obtained from an external document providing source;
A parameter database for storing data extraction parameters which are setting values for a rule for extracting data from the document and a data generation rule for defining a format for storing the extracted data; And
A processor configured to select the data extraction parameter based on the document provision source, construct the data generation rule based on the data extraction parameter, and extract structured form data from the document according to the data generation rule; Including,
Computing device.

A computer program recorded on a non-transitory computer readable medium, the instructions of the computer program being executed by a processor of the server,
Obtaining the document from a document providing source;
Obtaining a data extraction parameter corresponding to the document providing source from a parameter database;
Configuring a data generation rule that defines a rule for extracting data from the document based on the data extraction parameter and a format for storing the extracted data; And
And extracting structured form data from the document in accordance with the data generation rules.
Computer program.