KR20140038834A

KR20140038834A - Apparatus and method for analyzing web page

Info

Publication number: KR20140038834A
Application number: KR1020120105428A
Authority: KR
Inventors: 신동욱; 김태환; 김정선
Original assignee: 한양대학교 에리카산학협력단
Priority date: 2012-09-21
Filing date: 2012-09-21
Publication date: 2014-03-31
Also published as: KR101409386B1

Abstract

The present invention relates to a web page analyzing apparatus and a web page analyzing method. According to the present invention, an information block for providing information required by a user is accurately identified from web pages which handle various topics without being constructed in a typical format, for example, from a web page for introducing persons. Further, the web pages are not simply divided into an information block and a non-information block, but divided into various types of blocks based on the content of each information block. [Reference numerals] (100) Web page analyzing apparatus; (101) Division unit; (102) Extraction unit; (103) Sorting unit

Description

[0001] APPARATUS AND METHOD FOR ANALYZING WEB PAGE [0002]

본 발명은 웹 페이지를 분석하는 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for analyzing web pages.

웹 상에 존재하는 대부분의 웹 페이지들은 유용한 정보와 불필요한 데이터를 혼재하여 제공하고 있다. 웹 페이지로부터 배너(banners), 광고, 내비게이션 바, 메뉴, 저작권 표시와 같은 불필요한 데이터를 제외하고 유용한 정보만을 식별하여 사용자에게 제공하는 것이 웹 마이닝(web mining), 정보 추출, 정보 검색 등 다양한 분야에서 중요한 이슈로 자리잡고 있다. 따라서, 웹 페이지를 블럭으로 분할하고, 주요 컨텐츠의 포함 여부에 따라 웹 페이지 내 블럭을 정보 블럭과 비정보 블럭으로 분류하여 웹 페이지에서 정보 블럭만을 식별하는 작업이 요구된다. 웹 페이지로부터 정보 블럭을 식별하는 경우, 웹 마이닝, 정보 추출 및 검색의 성능이 향상될 수 있으며, 추출 및 검색된 정보의 정확도를 높일 수 있다. Most of the web pages on the web are providing mixed information of useful information and unnecessary data. It is necessary to identify only useful information from web pages, excluding unnecessary data such as banners, advertisements, navigation bars, menus, copyright marks, and provide them to users in various fields such as web mining, information extraction, It is becoming an important issue. Therefore, it is required to divide a web page into blocks and classify the blocks in the web page into information blocks and non-information blocks according to whether the main contents are included, and to identify only the information blocks in the web pages. When the information block is identified from the web page, the performance of web mining, information extraction and retrieval can be improved, and the accuracy of extracted and retrieved information can be improved.

종래의 정보 블럭 식별 방법은, 뉴스나 전자상거래 도메인과 같이 정형화된 포맷을 가지며 일관성 있는 단일 주제에 대한 컨텐츠로 구성된 도메인을 대상으로, 웹 페이지의 구조나 레이아웃과 같은 웹 페이지의 구조적 특징만을 고려하여 주요 컨텐츠를 포함하는 정보 블럭을 식별하였다. 이러한 이유로, 종래의 정보 블럭 식별 방법은 연구자의 홈페이지(academic hompages)와 같이 정형화된 포맷을 가지지 않으며 다양한 주제에 관한 컨텐츠를 제공하는 도메인의 웹 페이지에 대해서는 정보 블럭을 정확하게 식별하기 어려운 문제점이 있었다. A conventional information block identification method is a method for identifying a domain constituted by a content having a uniform format such as a news or an e-commerce domain and composed of a single consistent subject and considering the structural characteristics of a web page such as the structure and layout of the web page And identified information blocks containing the main content. For this reason, the conventional information block identification method has a problem that it does not have a formal format such as academic hompages and it is difficult to accurately identify an information block for a web page of a domain providing contents on various subjects.

본 발명의 일 실시예는 웹 페이지, 특히 연구자의 홈페이지와 같이 비정형화된 포맷을 가지며 다양한 주제를 다루는 개인 홈페이지로부터 사용자에게 유용한 정보를 정확하고 신속하게 추출하는 것을 목적으로 한다.An embodiment of the present invention aims at extracting information useful for a user accurately and quickly from a personal homepage having a non-standardized format such as a homepage of a web page, especially a researcher and dealing with various topics.

본 발명의 일 실시예에 따른 웹 페이지 분석 장치는, 웹 페이지를 적어도 하나의 블럭으로 분할하는 분할부; 상기 블럭에 포함된 단어와 상기 단어의 빈도수에 관한 단어 데이터, 및 상기 블럭에 포함된 단어의 품사와 상기 품사의 빈도수에 관한 품사 데이터 중 적어도 하나를 추출하는 추출부; 그리고 상기 단어 데이터 및 상기 품사 데이터 중 적어도 하나를 기반으로 상기 블럭을 분류하는 분류부를 포함할 수 있다. The apparatus for analyzing a web page according to an embodiment of the present invention includes: a division unit for dividing a web page into at least one block; Extracting at least one of words included in the block and word data related to the frequency of the word and part of speech data relating to the part of speech included in the block and the frequency of the part of speech; And a classifier for classifying the block based on at least one of the word data and the part of speech data.

본 발명의 일 실시예에 따른 웹 페이지 분석 방법은, 웹 페이지를 블럭으로 분할하고, 상기 블럭으로부터 추출한 특징값을 기반으로 상기 블럭을 분류할 수 있다. 상기 웹 페이지 분석 방법은, 상기 웹 페이지를 적어도 하나의 블럭으로 분할하는 단계; 상기 블럭에 포함된 단어와 상기 단어의 빈도수에 관한 단어 데이터, 및 상기 블럭에 포함된 단어의 품사와 상기 품사의 빈도수에 관한 품사 데이터 중 적어도 하나를 추출하는 단계; 그리고 상기 단어 데이터 및 상기 품사 데이터 중 적어도 하나를 기반으로 상기 블럭을 분류하는 단계를 포함할 수 있다. The web page analysis method according to an embodiment of the present invention divides a web page into blocks and classifies the blocks based on the feature values extracted from the blocks. The method of analyzing a web page comprises: dividing the web page into at least one block; Extracting at least one of words included in the block and word data related to the frequency of the words and part of speech data relating to the part of words included in the block and frequency of the part of speech; And classifying the block based on at least one of the word data and the part-of-speech data.

본 발명의 일 실시예에 따른 인물 프로필 작성 장치는, 인물을 소개하는 웹 페이지를 수신하는 수신부; 상기 웹 페이지를 블럭으로 분할하고, 상기 블럭을 분류하도록 상기 웹 페이지를 분석하는 웹 페이지 분석부; 상기 웹 페이지를 구성하는 블럭의 유형에 대한 정보를 제공하는 정보 제공부; 그리고 상기 블럭의 유형에 대한 정보 및 상기 블럭의 컨텐츠를 사용하여 상기 인물의 프로필을 작성하는 프로필 작성부를 포함할 수 있으며, 상기 웹 페이지 분석부는, 상기 웹 페이지를 적어도 하나의 블럭으로 분할하는 분할부; 상기 블럭에 포함된 단어와 상기 단어의 빈도수에 관한 단어 데이터, 및 상기 블럭에 포함된 단어의 품사와 상기 품사의 빈도수에 관한 품사 데이터 중 적어도 하나를 추출하는 추출부; 그리고 상기 단어 데이터 및 상기 품사 데이터 중 적어도 하나를 기반으로 상기 블럭을 분류하는 분류부를 포함할 수 있다. The apparatus for creating a personal profile according to an embodiment of the present invention includes: a receiving unit for receiving a web page introducing a person; A web page analyzer for dividing the web page into blocks and analyzing the web page to classify the blocks; An information providing unit for providing information on a type of a block constituting the web page; And a profile creating unit for creating a profile of the person using the information of the type of the block and the contents of the block. The web page analyzing unit may include a division unit dividing the web page into at least one block, ; Extracting at least one of words included in the block and word data related to the frequency of the word and part of speech data relating to the part of speech included in the block and the frequency of the part of speech; And a classifier for classifying the block based on at least one of the word data and the part of speech data.

본 발명의 일 실시예에 따른 웹 페이지 분석 방법은, 컴퓨터로 실행될 수 있는 프로그램으로 구현되어, 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. The method for analyzing a web page according to an embodiment of the present invention may be implemented by a computer-executable program and recorded on a computer-readable recording medium.

본 발명에 따르면, 규격화된 구조를 가지며 단일 주제를 다루는 정형화된 웹 페이지뿐만 아니라, 자유로운 포맷으로 구성되고 다양한 주제를 다루는 비정형화된 웹 페이지에 대해서도, 정보 블럭을 올바르게 식별할 수 있다. According to the present invention, information blocks can be correctly identified not only in a standard web page having a standardized structure and covering a single topic, but also in an informal web page configured in a free format and handling various subjects.

또한, 본 발명은 웹 페이지 내 블럭을 단순히 정보 블럭과 비정보 블럭으로만 양분하지 않고, 블럭이 제공하는 컨텐츠에 따라 정보 블럭을 기정의된 블럭 유형별로 다양하게 분류할 수 있다. In addition, the present invention does not merely divide a block in a web page into an information block and a non-information block, but may classify the information block by a predetermined block type according to the content provided by the block.

전술한 바와 같이, 본 발명은 웹 페이지로부터 정보 블럭을 정확하게 식별하고 컨텐츠에 따라 정보 블럭을 다양하게 분류함으로써, 웹 페이지로부터 사용자가 요구하는 정보를 보다 정확하고 신속하게 추출 및 검색할 수 있는 효과를 얻을 수 있다. As described above, the present invention accurately extracts and retrieves information requested by a user from a web page by accurately identifying the information block from the web page and classifying the information block according to the content in various ways Can be obtained.

도 1은 본 발명의 일 실시예에 따른 웹 페이지 분석 장치를 설명하는 블록도이다.
도 2는 정형화된 포맷으로 구성되고 단일 주제를 다루는 뉴스 웹 페이지를 블럭으로 분할한 모습을 나타낸다.
도 3은 비정형화된 포맷으로 구성되고 일정하지 않은 주제를 다루는 연구자의 홈페이지를 블럭으로 분할한 모습을 나타낸다.
도 4는 본 발명의 일 실시예에 따라 다수의 블럭으로 분할된 웹 페이지를 도시한다.
도 5는 도 4에 도시된 웹 페이지의 블럭으로부터 추출한 단어 데이터의 일 예를 나타낸다.
도 6은 도 4에 도시된 웹 페이지의 블럭으로부터 추출한 단어의 품사 분포를 나타내는 그래프이다.
도 7은 본 발명의 일 실시예에 따른 웹 페이지 분석 방법을 설명하는 흐름도이다.
도 8은 본 발명의 일 실시예에 따라 블럭으로부터 단어 데이터를 추출하는 과정을 설명하는 흐름도이다.
도 9는 본 발명의 일 실시예에 따라 블럭으로부터 품사 데이터를 추출하는 과정을 설명하는 흐름도이다.
도 10은 본 발명의 일 실시예에 따라 블럭을 유형별로 분류하는 과정을 설명하는 흐름도이다.
도 11은 본 발명의 일 실시예에 따른 사용자 프로필 작성 장치를 설명하는 블록도이다.1 is a block diagram illustrating a web page analyzer according to an embodiment of the present invention.
FIG. 2 shows a news web page constructed in a formatted format and dealing with a single topic divided into blocks.
FIG. 3 shows a homepage of a researcher who is structured in an informal format and deals with an unconventional subject, which is divided into blocks.
4 illustrates a web page divided into a plurality of blocks according to an embodiment of the present invention.
5 shows an example of word data extracted from the block of the web page shown in FIG.
FIG. 6 is a graph showing the parts-of-speech distribution of words extracted from the block of the web page shown in FIG.
7 is a flowchart illustrating a web page analysis method according to an embodiment of the present invention.
8 is a flowchart illustrating a process of extracting word data from a block according to an embodiment of the present invention.
9 is a flowchart illustrating a process of extracting parts of speech data from a block according to an embodiment of the present invention.
10 is a flowchart illustrating a process of classifying blocks by type according to an embodiment of the present invention.
11 is a block diagram illustrating an apparatus for creating a user profile according to an embodiment of the present invention.

본 발명의 다른 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술 되는 실시 예를 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시 예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예는 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Other advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

만일 정의되지 않더라도, 여기서 사용되는 모든 용어들(기술 혹은 과학 용어들을 포함)은 이 발명이 속한 종래 기술에서 보편적 기술에 의해 일반적으로 수용되는 것과 동일한 의미를 가진다. 일반적인 사전들에 의해 정의된 용어들은 관련된 기술 그리고/혹은 본 출원의 본문에 의미하는 것과 동일한 의미를 갖는 것으로 해석될 수 있고, 그리고 여기서 명확하게 정의된 표현이 아니더라도 개념화되거나 혹은 과도하게 형식적으로 해석되지 않을 것이다.Unless defined otherwise, all terms (including technical or scientific terms) used herein have the same meaning as commonly accepted by the generic art in the prior art to which this invention belongs. Terms defined by generic dictionaries may be interpreted to have the same meaning as in the related art and / or in the text of this application, and may be conceptualized or overly formalized, even if not expressly defined herein I will not.

본 명세서에서 사용된 용어는 실시 예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 '포함한다' 및/또는 이 동사의 다양한 활용형들 예를 들어, '포함', '포함하는', '포함하고', '포함하며' 등은 언급된 조성, 성분, 구성요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 조성, 성분, 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다. 본 명세서에서 '및/또는' 이라는 용어는 나열된 구성들 각각 또는 이들의 다양한 조합을 가리킨다.The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms' comprise 'and / or various forms of use of the verb include, for example,' including, '' including, '' including, '' including, Steps, operations, and / or elements do not preclude the presence or addition of one or more other compositions, components, components, steps, operations, and / or components. The term 'and / or' as used herein refers to each of the listed configurations or various combinations thereof.

한편, 본 명세서 전체에서 사용되는 '~부', '~기', '~블록', '~모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미할 수 있다. 예를 들어 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미할 수 있다. 그렇지만 '~부', '~기', '~블록', '~모듈' 등이 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부', '~기', '~블록', '~모듈'은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다.It should be noted that the terms such as '~', '~ period', '~ block', 'module', etc. used in the entire specification may mean a unit for processing at least one function or operation. For example, a hardware component, such as a software, FPGA, or ASIC. However, '~ part', '~ period', '~ block', '~ module' are not meant to be limited to software or hardware. Modules may be configured to be addressable storage media and may be configured to play one or more processors. &Lt; RTI ID = 0.0 >

따라서, 일 예로서 '~부', '~기', '~블록', '~모듈'은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '~부', '~기', '~블록', '~모듈'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부', '~기', '~블록', '~모듈'들로 결합되거나 추가적인 구성요소들과 '~부', '~기', '~블록', '~모듈'들로 더 분리될 수 있다. Thus, by way of example, the terms 'to', 'to', 'to block', 'to module' refer to components such as software components, object oriented software components, class components and task components Microcode, circuitry, data, databases, data structures, tables, arrays, and the like, as well as components, Variables. The functions provided in the components and in the sections ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ' , '~', '~', '~', '~', And '~' modules with additional components.

도 1은 본 발명의 일 실시예에 따른 웹 페이지 분석 장치(100)를 설명하는 블록도이다. 상기 웹 페이지 분석 장치(100)는 웹 페이지를 블럭으로 분할하고, 상기 블럭의 컨텐츠로부터 추출한 특징값을 기반으로 상기 블럭을 유형별로 분류할 수 있다. 상기 특징값은 블럭이 제공하는 컨텐츠의 특성을 나타내는 값으로, 상기 웹 페이지 분석 장치(100)는 상기 특징값에 기초하여 웹 페이지 내 블럭을 사전에 정의된 블럭 유형으로 분류할 수 있다. 1 is a block diagram illustrating a web page analysis apparatus 100 according to an embodiment of the present invention. The web page analyzing apparatus 100 may divide the web page into blocks and classify the blocks by type based on feature values extracted from the contents of the blocks. The feature value is a value representing a characteristic of the content provided by the block, and the web page analyzing apparatus 100 may classify the block in the web page into a predefined block type based on the feature value.

도 1에 도시된 바와 같이, 상기 웹 페이지 분석 장치(100)는 분할부(101), 추출부(102) 및 분류부(103)를 포함할 수 있다. 상기 분할부(101)는 웹 페이지를 적어도 하나의 블럭으로 분할할 수 있다. 상기 추출부(102)는 상기 블럭에 포함된 단어와 그 빈도수에 관한 단어 데이터 및 상기 블럭에 포함된 단어의 품사와 그 빈도수에 관한 품사 데이터 중 적어도 하나를 추출할 수 있다. 상기 분류부(103)는 상기 단어 데이터 및 상기 품사 데이터 중 적어도 하나를 기반으로 상기 블럭을 분류할 수 있다. As shown in FIG. 1, the web page analyzing apparatus 100 may include a dividing unit 101, an extracting unit 102, and a classifying unit 103. The division unit 101 may divide a web page into at least one block. The extraction unit 102 may extract at least one of words included in the block, word data related to the frequency of the word, and part of speech data related to the frequency of the words included in the block. The classifying unit 103 may classify the block based on at least one of the word data and the part-of-speech data.

도 2는 본 발명의 일 실시예에 따라 웹 페이지를 블럭으로 분할한 모습을 나타낸다. 도 2에 도시된 뉴스 웹 페이지(20)는 정형화된 포맷으로 구성되고 단일한 주제를 다루는 웹 페이지이다. 상기 뉴스 웹 페이지(20)는 일반적으로 상단에 회사의 로고(CNN)를 표시하고, 좌측에 메뉴를 게시하고, 하단에 저작권 정보를 표시하고, 중앙에 기사와 같은 주요 컨텐츠를 제공하는 포맷으로 구성된다. 따라서, 웹 페이지(20) 내 블럭의 크기나 위치와 같은 구조적인 특징을 이용하여도, 정보 블럭(201)과 비정보 블럭(202)이 비교적 명확하게 구분될 수 있다. FIG. 2 illustrates a web page divided into blocks according to an embodiment of the present invention. The news web page 20 shown in Fig. 2 is a web page configured in a formatted format and handling a single topic. The news web page 20 is generally configured in such a format as to display a company logo (CNN) at the top, a menu at the left, a copyright information at the bottom, and a main content such as an article at the center do. Therefore, the information block 201 and the non-information block 202 can be relatively clearly distinguished by using structural features such as the size and position of the block in the web page 20.

도 3은 본 발명의 일 실시예에 따라 연구자의 홈페이지를 블럭으로 분할한 모습을 나타낸다. 도 3a 내지 도 3d에 도시된 바와 같이, 연구자의 홈페이지(31, 32, 33, 34)는 뉴스 웹 페이지(20)와 달리, 정형화되지 않은 포맷으로 구성되어 있으며, 일정한 주제에 대한 정보를 일관성 있게 제공하지 않는 비규격 웹 페이지의 전형적인 예이다. 도 3a 및 도 3b에 도시된 연구자의 홈페이지(31, 32)는 웹 페이지의 상단 블럭(311, 321)에 연구자의 이름을 표시하고 있으나, 도 3c 및 도 3d에 도시된 연구자의 홈페이지(33, 34)는 웹 페이지의 상단 블럭이 아닌 다른 부분(333, 342)에 표시하고 있다. FIG. 3 shows a homepage of a researcher divided into blocks according to an embodiment of the present invention. Unlike the news web page 20, the homepages 31, 32, 33, and 34 of the researcher are configured in unformatted formats as shown in FIGS. 3A to 3D, This is a typical example of a non-compliant web page that does not provide it. The researchers 'homepages 31 and 32 shown in FIGS. 3A and 3B display the names of the researchers in the upper blocks 311 and 321 of the web page. However, the researchers' homepages 33 and 32 shown in FIGS. 34 are displayed on the portions 333, 342 other than the upper block of the web page.

나아가, 연구자의 사진 역시 도 3a 및 도 3c의 연구자 홈페이지(31)에서는 우측 상단(312)에 게시되지만, 도 3b의 연구자 홈페이지(32)에서는 위에서 두 번째 블럭(322)에 게시되고, 도 3d의 연구자 홈페이지(34)에서는 좌측 상단(342)에 게시된다. 이러한 연구자의 홈페이지(31, 32, 33, 34)는 블럭의 크기나 위치와 같은 웹 페이지의 구조적인 특징만으로는 정보 블럭을 정확하게 식별하기 어렵다.In addition, the photograph of the researcher is also posted on the upper right 312 of the researcher's homepage 31 of FIG. 3A and FIG. 3C, but posted on the second block 322 of the researcher's homepage 32 of FIG. And is posted on the left upper end 342 in the researcher homepage 34. [ It is difficult to accurately identify the information block by only the structural characteristics of the web page such as the size and the position of the block of the researchers' homepages (31, 32, 33, 34).

따라서, 본 발명의 일 실시예에 따른 웹 페이지 분석 장치(100)는 웹 페이지를 블럭으로 분할한 후, 상기 블럭이 제공하는 컨텐츠를 분석하여 블럭의 특성을 나타내는 특징값을 추출한다. 그리고 나서, 상기 웹 페이지 분석 장치(100)는 상기 특징값에 기초하여 웹 페이지 내 블럭을 사전에 정의된 블럭 유형별로 분류한다. Accordingly, the web page analyzing apparatus 100 according to an embodiment of the present invention divides a web page into blocks, and then analyzes the contents provided by the block to extract feature values indicating the characteristics of the blocks. Then, the web page analyzing apparatus 100 classifies the blocks in the web page by predefined block types based on the feature values.

상기 분할부(101)는 웹 페이지를 적어도 하나의 블럭으로 분할한다. 웹 페이지의 가독성 증가를 위해 <Table>, <UL>, <P>와 같은 태그들이 웹 페이지 내 컨텐츠를 구조화하기 위해 사용될 수 있으며, 상기 분할부(101)는 이러한 태그에 기반한 DOM 구조를 이용하여 웹 페이지를 하나 또는 그 이상의 블럭으로 분할할 수 있다. 또한, 상기 분할부(101)는 상기 DOM 구조와 함께, 웹 페이지 내 컨텐츠의 폰트(font), 색상 및 블럭의 크기와 같은 웹 페이지의 시각적인 특징(visual cues)을 더 고려하여 블럭을 검출할 수 있다. The division unit 101 divides the web page into at least one block. Tags such as <Table>, <UL>, and <P> may be used for structuring the content in the web page in order to increase the readability of the web page, and the division unit 101 may use the DOM structure based on this tag A web page can be divided into one or more blocks. In addition to the DOM structure, the dividing unit 101 may detect a block by further considering visual cues of a web page such as the font, color, and block size of the contents in the web page .

도 4는 본 발명의 일 실시예에 따라 다수의 블럭으로 분할된 웹 페이지의 일 예를 도시한다. 도 4에 도시된 바와 같이, 상기 분할부(101)는 웹 페이지(40)를 시각적인 특징에 따라 제 1 블럭(401), 제 2 블럭(402), 제 3 블럭(403) 및 제 4 블럭(404)으로 분할할 수 있다. 본 발명의 일 실시예에 따르면, 상기 분할부(101)는 VIPS(Vision-based Page Segmentation) 알고리즘을 사용하여 웹 페이지(40)를 블럭으로 분할할 수 있다. 상기 VIPS 알고리즘은 웹 페이지를 블록 단위로 나누는 비젼 기법을 이용한 페이지 세그먼테이션 알고리즘이다. FIG. 4 illustrates an example of a web page divided into a plurality of blocks according to an embodiment of the present invention. 4, the division unit 101 divides the web page 40 into a first block 401, a second block 402, a third block 403, and a fourth block 403 according to a visual characteristic. (404). According to an embodiment of the present invention, the division unit 101 may divide the web page 40 into blocks using a Vision-based Page Segmentation (VIPS) algorithm. The VIPS algorithm is a page segmentation algorithm using a vision technique for dividing a web page into blocks.

상기 추출부(102)는 상기 블럭으로부터 블럭에 포함된 단어와 상기 단어의 빈도수에 관한 단어 데이터, 및 상기 블럭에 포함된 단어의 품사와 상기 품사의 빈도수에 관한 품사 데이터 중 적어도 하나를 추출할 수 있다. 다시 말해, 상기 추출부(102)는 상기 블럭(401, 402, 403, 404)이 제공하는 컨텐츠로부터 블럭의 특성을 나타내는 특징값을 추출할 수 있으며, 상기 특징값은 상기 단어 데이터와 상기 품사 데이터 중 어느 하나 또는 둘 모두를 포함할 수 있다. The extracting unit 102 extracts at least one of words included in the block and word data related to the frequency of the word and part of speech data related to the frequency of the part of speech included in the block from the block have. In other words, the extraction unit 102 may extract feature values indicating the characteristics of the block from the contents provided by the blocks 401, 402, 403, and 404, and the feature values may include the word data, &Lt; / RTI > or both.

도 5는 도 4에 도시된 웹 페이지(40)의 블럭(401, 402, 403, 404)으로부터 추출한 단어 데이터의 일 예를 나타낸다. 도 5에 도시된 바와 같이, 상기 추출부(102)는 상기 블럭(401, 402, 403, 404)에 포함된 단어를 추출하고, 블럭마다 각 단어의 출현 빈도수를 산출할 수 있다. 5 shows an example of word data extracted from the blocks 401, 402, 403, and 404 of the web page 40 shown in FIG. As shown in FIG. 5, the extraction unit 102 may extract the words included in the blocks 401, 402, 403, and 404, and may calculate the occurrence frequency of each word for each block.

본 발명의 일 실시예에 따르면, 상기 추출부(102)는, 상기 블럭에 포함된 단어 중 출현 빈도가 높은 핵심어(keyword)를 검출하고, 상기 핵심어의 빈도수를 벡터의 성분으로 하는 단어 벡터(term vector)를 생성하는 단어 벡터 생성부를 포함할 수 있다. 예를 들어, 상기 단어 벡터 생성부는 블럭 내 출현 빈도수가 높은 순서대로 단어를 소정의 개수만큼 선택하고, 해당 단어를 핵심어로 선정할 수 있다. 그리고 나서, 상기 단어 벡터 생성부는 상기 핵심어의 출현 빈도수를 성분으로 하는 단어 벡터를 생성할 수 있다. According to an embodiment of the present invention, the extracting unit 102 may detect a keyword having a high frequency of appearance among the words included in the block, and use a word vector (term and a word vector generator for generating a vector vector. For example, the word vector generation unit may select a predetermined number of words in order of frequency of occurrence in a block, and may select the word as a keyword. Then, the word vector generation unit may generate a word vector having the occurrence frequency of the keyword as a component.

본 발명의 일 실시예에 따르면, 상기 단어 벡터 생성부는 상기 핵심어를 검출하기 전에, 상기 블럭에 포함된 단어 중 불용어(stop word)를 제거하는 스테밍(stemming) 작업을 수행할 수 있다. 그리고 나서, 상기 단어 벡터 생성부는 각 단어의 빈도수에 TF-IDF(Term Frequency - Inverse Document Frequency) 가중치를 부여하여 단어 벡터를 구성할 수 있다. According to an embodiment of the present invention, the word vector generation unit may perform a stemming operation to remove a stop word among the words included in the block before detecting the keyword. Then, the word vector generator may assign a term frequency-inverted document frequency (TF-IDF) weight to the frequency of each word to form a word vector.

본 발명의 일 실시예에 따르면, 상기 추출부(102)는, 상기 블럭에 포함된 단어의 품사를 결정하고, 상기 블럭에 포함된 단어의 개수에 대한 품사 각각의 빈도수의 비를 산출하는 품사 분포 산출부를 포함할 수 있다. 상기 품사 분포 산출부는 상기 단어의 품사를 보통명사, 고유명사, 대명사, 전치사, 접속사, 동사, 수사, 부사, 형용사, 조사, 관형사 및 감탄사 중 어느 하나로 결정할 수 있으나, 품사의 종류는 이에 제한되지 않고 실시예에 따라 다양하게 설정될 수 있다. 예를 들어, 상기 품사 분포 산출부는 전술한 품사들 외에 복수명사, 단수동사, 외국어, 부호, 소유격 어미 중 어느 하나로 결정할 수도 있다. According to an embodiment of the present invention, the extracting unit 102 determines a part of speech included in the block, and calculates a part-of-speech distribution that calculates a ratio of the frequency of each part of speech to the number of words included in the block And a calculation unit. The part-of-speech distribution calculator may determine the part-of-speech word as one of ordinary nouns, proper nouns, pronouns, prepositions, conjunctions, verbs, investigations, adverbs, adjectives, surveys, observations and exclamations, May be variously set according to the embodiment. For example, the part-of-speech distribution calculating unit may determine the part-of-speech distribution calculating unit to be one of plural nouns, singular verbs, foreign language, sign, and possessive endings in addition to the above-mentioned parts-of-speech.

도 6a는 도 4에 도시된 웹 페이지(40)의 블럭(401, 402, 403, 404)으로부터 추출한 품사 데이터의 일 예를 나타낸다. 도 6a에 도시된 바와 같이, 상기 품사 분포 산출부는 각 블럭에 포함된 단어에 품사 태깅을 하고, 각 블럭마다 품사의 분포를 산출하여 블럭의 특징값으로 활용할 수 있다. 도 6a에서, NN은 명사, NNP는 고유명사, NNS는 복수명사, IN은 전치사, CC는 접속사, VBP는 단수동사, FW는 외국어, CD는 수사, SYM은 부호, RB는 부사, JJ는 형용사, POS는 소유격 어미를 나타낸다. 6A shows an example of parts of speech data extracted from the blocks 401, 402, 403, and 404 of the web page 40 shown in FIG. As shown in FIG. 6A, the parts-of-speech distribution calculating unit may perform part-of-speech tagging on words included in each block, and calculate the distribution of parts-of-speech for each block to utilize as a feature value of the block. 6A, NN is a noun, NNP is a proper noun, NNS is a plural noun, IN is a preposition, CC is a conjunction, VBP is a singular verb, FW is a foreign language, CD is an investigation, SYM is a sign, RB is an adverb, , POS stands for the possessive term.

그리고 나서, 상기 품사 분포 산출부는 각 블럭에 포함된 단어의 개수에 대한 품사 각각의 빈도수의 비를 산출할 수 있다. 도 6b는 도 6a의 품사 데이터로부터 산출한 각 블럭의 품사 분포 그래프를 도시한다. 도 6b에 도시된 바와 같이, 제 1 블럭(401)은 고유 명사의 비율이 높고, 제 3 블럭(403)은 고유명사와 수사의 비율이 높고, 제 4 블럭(404)은 고유명사와 수사의 비율이 높으며 그 외 다양한 품사로 구성됨을 확인할 수 있다. Then, the part-of-speech distribution calculating unit may calculate the ratio of the frequency of each part of speech to the number of words included in each block. FIG. 6B shows a part-of-speech distribution graph of each block calculated from the part-of-speech data of FIG. 6A. 6B, the first block 401 has a high percentage of proper nouns, the third block 403 has a high ratio of proper nouns and rhetoric, and the fourth block 404 has a high ratio of proper nouns and rhetoric And it is composed of various other parts companies.

상기 분류부(103)는 상기 추출부(102)가 추출한 단어 데이터 및 품사 데이터 중 적어도 하나를 기반으로 하여 블럭을 유형별로 분류할 수 있다. 본 발명의 일 실시예에 따르면, 상기 웹 페이지 분석 장치(100)가 분석하는 웹 페이지는 인물, 예컨대 연구자를 소개하는 홈페이지일 수 있다. 이 경우, 상기 분류부(103)는 상기 블럭을 다음의 블럭 중 어느 하나로 분류할 수 있다:The classifier 103 may classify the blocks by type based on at least one of the word data and the part-of-speech data extracted by the extractor 102. According to an embodiment of the present invention, the web page analyzed by the web page analyzing apparatus 100 may be a web page for introducing a person, for example, a researcher. In this case, the classifying unit 103 may classify the block into one of the following blocks:

(i) 인물의 소속 또는 직위에 대한 정보를 포함하는 기본 블럭;(i) a basic block containing information about a person's position or position;

(ii) 인물의 연락처에 대한 정보를 포함하는 연락처 블럭;(ii) a contact block containing information about the contact of the person;

(iii) 인물의 사진 이미지를 포함하는 사진 블럭;(iii) a photo block containing a photographic image of the person;

(iv) 인물이 집필한 글에 대한 정보를 포함하는 글 블럭; 및(iv) a text block containing information about the article written by the person; And

(v) 인물의 소속 또는 직위에 대한 정보, 인물의 연락처에 대한 정보, 인물의 사진 이미지, 및 인물이 집필한 글에 대한 정보 중 둘 이상을 포함하는 혼합 블럭.(v) a blend block that contains more than one of information about a person's position or position, information about a contact of a person, a photographic image of a person, and information about a person's writing.

본 발명의 일 실시예에 따르면, 상기 분류부(103)는 블럭이 전술한 다섯 개의 블럭 중 어느 것에도 해당되지 않으면, 상기 블럭을 비정보 블럭으로 분류할 수 있다. According to an embodiment of the present invention, the classifying unit 103 may classify the block into a non-information block if the block does not correspond to any of the above-mentioned five blocks.

상기 블럭의 유형들은 웹 페이지의 분석에 앞서, 다수의 샘플 홈페이지 내 컨텐츠를 분석함으로써 사전에 정의될 수 있다. 예를 들어, 연구자의 홈페이지는 연구자의 인적 사항에 관련된 프로필 정보(profile information)와 연구 분야에 관련된 연구 정보(research information)를 제공할 수 있다. 프로필 정보와 연구 정보에 속하는 속성은 다음의 표 1과 같다:The types of blocks may be defined in advance by analyzing the content in the plurality of sample homepages prior to the analysis of the web page. For example, the researcher's homepage can provide profile information related to the researcher's personal information and research information related to the research field. Attributes belonging to profile information and research information are shown in Table 1 below.

정보Information 속성property 프로필 정보Profile Information photo, position, affiliation, phone, fax, email, address, bsuniv, bsmajor, bsdate, msuniv, msmajor, msdate, phduniv, phdmajor, phddatephoto, position, affiliation, phone, fax, email, address, bsuniv, bsmajor, bsdate, msuniv, msmajor, msdate, phduniv, phdmajor, phddate 연구 정보Research Information research area/interesting, publication list, research interest(implicit)research area / interesting, publication list, research interest (implicit)

여기서, bsuniv는 학부 대학, bsmajor는 학부 전공, bsdate는 학사 취득일, msuniv는 석사 취득 대학, msmajor는 석사 학위 전공, msdate는 석사 취득일, phduniv는 박사 취득 대학, phdmajor는 박사 학위 전공, phddate는 박사 취득일을 의미한다. 또한, research area/interesting은 연구자 홈페이지에 명시적으로 기재된 연구분야/관심분야를 의미하고, research interesting(implicit)은 연구자 홈페이지에 묵시적으로 드러난 연구분야/관심분야를 의미한다. Where bsuniv is the undergraduate university, bsmajor is the undergraduate major, bsdate is the bachelor's degree, msuniv is the master's degree university, msmajor is the master's degree major, msdate is the master's degree date, phduniv is the doctoral degree university, phdmajor is the doctoral degree, . In addition, research area / interesting means the field of research / interest explicitly stated on the researcher homepage, and research interesting (implicit) means the research field / field of interest implied on the researcher's homepage.

특정 도메인이 제공하는 정보는 필수 정보와 선택 정보로 구분할 수 있다. 필수 정보는 도메인 내 웹 페이지가 반드시 포함해야 하는 정보를 의미하고, 선택 정보는 필요에 의해 선택적으로 포함되는 정보를 의미한다. 본 발명의 일 실시예에 따르면, 임의의 개수의 홈페이지에서 프로필 정보와 연구 정보에 해당하는 각 속성의 발생 빈도를 분석하고, 그 중 소정 비율 이상(예컨대, 70% 이상) 발견된 속성을 연구자 홈페이지 내 필수 정보로 결정할 수 있다. 그리고, 각 속성 간의 연관성을 고려하여 전술한 6 개 범주에 해당하는 블럭 유형을 정의할 수 있다. The information provided by a specific domain can be classified into essential information and optional information. The essential information means information that a web page in a domain must necessarily include, and the selection information means information that is selectively included as needed. According to an embodiment of the present invention, the occurrence frequency of each property corresponding to the profile information and the research information is analyzed in an arbitrary number of homepages, and the attribute found in a predetermined ratio or more (for example, 70% or more) I can decide with my essential information. In addition, the block types corresponding to the above six categories may be defined in consideration of the association between the attributes.

본 발명의 일 실시예에 따르면, 상기 분류부(103)는 블럭의 단어 데이터 또는 품사 데이터를, 상기 기본 블럭, 연락처 블럭, 사진 블럭, 글 블럭 및 혼합 블럭마다 설정된 기준 단어 데이터 또는 기준 품사 데이터와 비교할 수 있다. 그리고 나서, 상기 분류부(103)는 상기 단어 데이터가 기준 단어 데이터와 매칭되거나 품사 데이터가 기준 품사 데이터와 매칭되는 경우, 상기 블럭을 상기 매칭된 기준 단어 데이터 또는 기준 품사 데이터에 대응되는 블럭으로 분류할 수 있다. According to an embodiment of the present invention, the classifying unit 103 may classify word data or part-of-speech data of blocks into reference word data or reference part-of-speech data set for each of the basic block, contact block, photo block, Can be compared. If the word data is matched with the reference word data or the part-of-speech data is matched with the reference part-of-speech data, the classifying unit 103 classifies the block into blocks corresponding to the matched reference word data or the reference part-of-speech data can do.

상기 기본 블럭, 연락처 블럭, 사진 블럭, 글 블럭 및 혼합 블럭마다 설정된 기준 단어 데이터 또는 기준 품사 데이터는 메모리와 같은 저장부에 저장되어 분류 프로세스가 실행될 때 독출되어 사용될 수 있다. 상기 기준 단어 데이터 및 기준 품사 데이터는 전술한 블럭 유형의 특징을 나타내는 데이터이다. 예를 들어, 연락처 블럭의 경우 인물의 전화번호, 팩스번호, 이메일주소, 직장주소와 같은 연락처를 포함하고 있으므로, "Tel", "Phone", "Fax", "Mobile", "E-mail"과 같은 단어가 핵심어로 포함되어 있을 수 있다. 따라서, 연락처 블럭의 경우, 블럭 내 상기 핵심어의 예상 빈도수를 성분으로 하는 단어 벡터가 기준 단어 데이터로 설정될 수 있다. 또한, 연락처 블럭은 인물의 각종 전화번호들을 포함하고 있으므로, 다른 블럭에 비해 상대적으로 수사의 비율이 높을 수 있다. 따라서, 블럭 내 수사의 출현 비율이 30% 이상이 되면 연락처 블럭으로 분류되도록, 수사에 대한 출현 비율 임계치를 30%로 하는 기준 품사 데이터를 연락처 블럭에 대해 설정할 수 있다. The reference word data or reference word data set for each of the basic block, the contact block, the photo block, the text block, and the mixed block may be stored in a storage unit such as a memory and read out when the classification process is executed. The reference word data and the reference part-of-speech data are data representing the characteristics of the above-described block type. For example, a contact block contains contacts such as a person's phone number, fax number, e-mail address, and work address, so you can select "Tel", "Phone", "Fax", "Mobile" May be included as key words. Therefore, in the case of the contact block, a word vector having the expected frequency of the keyword in the block as a component may be set as the reference word data. Further, since the contact block includes various telephone numbers of the person, the ratio of the investigation can be relatively high as compared with other blocks. Therefore, the reference part data can be set for the contact block so that the occurrence rate of the block is 30% or more so that the contact block is classified as the contact block, and the occurrence ratio threshold for the investigation is 30%.

본 발명의 일 실시예에 따르면, 상기 기준 단어 데이터 또는 기준 품사 데이터는 Naive Bayes 분류기를 사용하여 구축되거나, normalized polynomial kernel을 적용한 SVM(Support Vector Machine)을 사용하여 구축될 수 있다. According to an embodiment of the present invention, the reference word data or reference part data may be constructed using a Naive Bayes classifier or a SVM (Support Vector Machine) using a normalized polynomial kernel.

도 7은 본 발명의 일 실시예에 따른 웹 페이지 분석 방법을 설명하는 흐름도이다. 상기 웹 페이지 분석 방법(70)은 웹 페이지를 블럭으로 분할하고, 상기 블럭으로부터 추출한 특징값을 기반으로 상기 블럭을 분류할 수 있다. 여기서, 상기 특징값은 블럭에 포함된 단어와 그 빈도수에 관한 단어 데이터, 및 블럭에 포함된 단어의 품사와 그 빈도수에 관한 품사 데이터 중 적어도 하나를 포함할 수 있다. 7 is a flowchart illustrating a web page analysis method according to an embodiment of the present invention. The web page analysis method 70 may divide a web page into blocks, and classify the blocks based on the feature values extracted from the blocks. Here, the feature value may include at least one of words included in the block, word data related to the frequency of the words, and parts of speech data related to the part of words included in the block and the frequency.

도 7에 도시된 바와 같이, 본 발명의 일 실시예에 따른 웹 페이지 분석 방법(70)은, 웹 페이지를 적어도 하나의 블럭으로 분할하는 단계(S71), 상기 블럭에 포함된 단어와 그 빈도수에 관한 단어 데이터, 및 상기 블럭에 포함된 단어의 품사와 그 빈도수에 관한 품사 데이터 중 적어도 하나를 추출하는 단계(S72), 및 상기 단어 데이터 및 상기 품사 데이터 중 적어도 하나를 기반으로 상기 블럭을 분류하는 단계(S73)를 포함할 수 있다. As shown in FIG. 7, the web page analysis method 70 according to an embodiment of the present invention includes dividing a web page into at least one block (S71), extracting words included in the block and the frequency (S72) of extracting at least one of word data relating to a word and word data related to a word included in the block, and part of speech data relating to the frequency of the word included in the block, and classifying the block based on at least one of the word data and the parts of speech data Step S73.

본 발명의 일 실시예에 따르면, 상기 웹 페이지를 블럭으로 분할하는 단계(S71)는, 웹 페이지의 DOM 구조와 함께 폰트, 글자색, 블록의 크기와 같은 시각적인 특징을 반영하여 웹 페이지를 다수의 블럭으로 분할하는 단계를 포함할 수 있다. According to an embodiment of the present invention, the step S71 of dividing the web page into blocks may include displaying a plurality of web pages reflecting a visual characteristic such as a font, a font color, and a block size together with a DOM structure of the web page Into blocks of < / RTI >

도 8은 본 발명의 일 실시예에 따라 블럭으로부터 단어 데이터를 추출하는 과정을 설명하는 흐름도이다. 도 8에 도시된 바와 같이, 상기 블럭으로부터 단어 데이터를 추출하는 단계(S72)는, 블럭에 포함된 단어를 추출하는 단계(S721), 상기 블럭에 포함된 단어 중 불용어를 제거하는 단계(S722), 상기 블럭에 포함된 단어 중 출현 빈도가 높은 핵심어를 검출하는 단계(S723), 및 상기 핵심어의 빈도수를 벡터 성분으로 하는 단어 벡터를 생성하는 단계(S724)를 포함할 수 있다. 8 is a flowchart illustrating a process of extracting word data from a block according to an embodiment of the present invention. As shown in FIG. 8, the step S72 of extracting word data from the block includes extracting words included in the block S721, removing an insoluble word among the words included in the block S722, A step S723 of detecting a keyword having a high appearance frequency among the words included in the block, and a step S724 of generating a word vector having the frequency of the keyword as a vector component.

예를 들어, 상기 추출하는 단계(S72)는, 블럭에 포함된 단어 중 정보 검색 시 검색 용어로 사용하지 않는 단어(예컨대, 관사, 전치사, 조사, 접속사 등)를 불용어로 간주하여 제거하고, 나머지 단어들 중 출현 빈도가 높은 순서대로 기설정된 개수의 단어(예컨대, 상위 네 개의 단어)를 핵심어로 선정할 수 있다. 그리고 나서, 핵심어로 선정된 단어의 빈도수를 벡터 성분으로 하여 단어 벡터를 생성할 수 있다. For example, the extracting step S72 removes words (for example, articles, prepositions, surveys, connections, etc.) that are not used as search terms in the information retrieval among the words included in the block, A predetermined number of words (for example, the top four words) can be selected as key words in the order of appearance frequency of the words. Then, a word vector can be generated using the frequency of the word selected as the keyword as a vector component.

도 9는 본 발명의 일 실시예에 따라 블럭으로부터 품사 데이터를 추출하는 과정을 설명하는 흐름도이다. 도 9에 도시된 바와 같이, 상기 품사 데이터를 추출하는 단계(S72)는, 블럭에 포함된 단어를 추출하는 단계(S721), 상기 블럭에 포함된 단어의 품사를 결정하는 단계(S722), 및 상기 블럭에 포함된 단어의 개수에 대한 품사의 빈도수의 비를 산출하는 단계(S723)를 포함할 수 있다. 상기 블럭에 포함된 단어의 품사는 보통명사, 고유명사, 대명사, 전치사, 접속사, 동사, 수사, 부사, 형용사, 조사, 관형사 및 감탄사 중 어느 하나일 수 있으나, 품사의 종류는 이에 제한되지 않고 실시예에 따라 다양하게 설정될 수 있다. 9 is a flowchart illustrating a process of extracting parts of speech data from a block according to an embodiment of the present invention. 9, the extracting of part-of-speech data (S72) includes extracting words included in a block (S721), determining parts of speech included in the block (S722), and And calculating a ratio of parts-of-speech frequency to the number of words included in the block (S723). The part of the word included in the block may be any one of a noun, proper noun, pronoun, preposition, conjunction, verb, investigation, adverb, adjective, survey, adjective, and exclamation, Can be variously set according to the example.

본 발명의 일 실시예에 따르면, 상기 웹 페이지는 인물을 소개하는 홈페이지, 예컨대 연구자의 홈페이지일 수 있다. 이 경우, 상기 블럭을 분류하는 단계는, 블럭을 전술한 기본 블럭, 연락처 블럭, 사진 블럭, 글 블럭 및 혼합 블럭 중 어느 하나로 분류하는 단계; 및 블럭이 상기 다섯 가지 블럭 중 어느 블럭에도 해당되지 않으면, 상기 블럭을 비정보 블럭으로 분류하는 단계를 포함할 수 있다. According to an embodiment of the present invention, the web page may be a homepage for introducing a person, for example, a homepage of a researcher. In this case, the step of classifying the block may include: classifying the block into one of the basic block, the contact block, the photo block, the text block, and the mixed block; And classifying the block into a non-information block if the block does not correspond to any of the five blocks.

도 10은 본 발명의 일 실시예에 따라 블럭을 분류하는 단계를 설명하는 흐름도이다. 도 10에 도시된 바와 같이, 상기 블럭을 분류하는 단계(S73)는, 블럭의 단어 데이터 또는 품사 데이터를 기본 블럭, 연락처 블럭, 사진 블럭, 글 블럭 및 혼합 블럭마다 설정된 기준 단어 데이터 또는 기준 품사 데이터와 비교하는 단계(S731), 상기 단어 데이터 또는 상기 품사 데이터가 특정 블럭의 기준 단어 데이터 또는 기준 품사 데이터와 매칭되는 경우(S732에서 예), 상기 매칭된 기준 단어 데이터 또는 기준 품사 데이터에 대응되는 블럭의 유형으로 상기 블럭을 분류하는 단계(S733)를 포함할 수 있다. 10 is a flow diagram illustrating the steps of classifying blocks in accordance with one embodiment of the present invention. As shown in FIG. 10, in the step S73 of classifying the block, the word data or the part-of-speech data of the block may be set to reference word data or reference part-of-speech data set for each basic block, contact block, photo block, text block, and mixed block. (S731), when the word data or the part-of-speech data match the reference word data or the reference part-of-speech data of a specific block (YES in S732), the block corresponding to the matched reference word data or the reference part-of-speech data. And classifying the block by the type of S733.

본 발명의 일 실시예에 따르면, 상기 블럭을 분류하는 단계(S73)는, 상기 블럭의 단어 데이터 또는 품사 데이터가 임의의 기준 단어 데이터 또는 기준 품사 데이터와 매칭되지 않는 경우(S732에서 아니오), 상기 블럭을 비정보 블럭으로 분류하는 단계(S734)를 포함할 수 있다. According to an embodiment of the present invention, the step S73 of classifying the block may include a step of classifying the block if the word data or the part-of-speech data of the block does not match any of the reference word data or the reference parts data (NO in step S732) And sorting the block into non-information blocks (S734).

도 11은 본 발명의 일 실시예에 따른 인물 프로필 작성 장치를 설명하는 블록도이다. 상기 인물 프로필 작성 장치(200)는, 특정 인물의 홈페이지로부터 인물에 대한 각종 정보를 추출하여 해당 인물의 프로필을 작성하여 사용자에게 제공할 수 있다. 도 11에 도시된 바와 같이, 본 발명의 일 실시예에 따른 인물 프로필 작성 장치(200)는, 수신부(201), 웹 페이지 분석부(202), 정보 제공부(203) 및 프로필 작성부(204)를 포함할 수 있다. 11 is a block diagram illustrating a person profile creating apparatus according to an embodiment of the present invention. The person profile creating apparatus 200 can extract various information about a person from a homepage of a specific person, create a profile of the person, and provide the profile to the user. 11, a person profile creating apparatus 200 according to an embodiment of the present invention includes a receiving unit 201, a web page analyzing unit 202, an information providing unit 203, and a profile creating unit 204 ).

상기 수신부(201)는 인물을 소개하는 웹 페이지를 수신할 수 있다. 상기 인물을 소개하는 웹 페이지는 도 3 및 도 4에 도시된 바와 같이 연구자의 홈페이지일 수 있다. The receiving unit 201 can receive a web page introducing a person. The web page introducing the person may be the homepage of the researcher as shown in Figs. 3 and 4.

본 발명의 일 실시예에 따르면, 상기 수신부(201)는 상기 웹 페이지를 표현하는 코드 데이터를 수신할 수 있다. 예를 들어, 상기 수신부(201)는 상기 웹 페이지의 HTML 코드 또는 XML 코드를 수신할 수 있다. 본 발명의 다른 실시예에 따르면, 상기 수신부(201)는 웹 페이지의 코드 대신, 웹 페이지의 URL 주소를 수신하고, 상기 URL 주소에 액세스하여 상기 웹 페이지의 HTML 코드 또는 XML 코드를 수신할 수도 있다. According to an embodiment of the present invention, the receiving unit 201 may receive code data representing the web page. For example, the receiving unit 201 may receive HTML code or XML code of the web page. According to another embodiment of the present invention, the receiving unit 201 may receive the URL address of the web page instead of the code of the web page, and may receive the HTML code or the XML code of the web page by accessing the URL address .

상기 웹 페이지 분석부(202)는, 웹 페이지를 블럭으로 분할하고 상기 블럭을 분류하도록 상기 웹 페이지를 분석할 수 있다. 상기 웹 페이지 분석부(202)는, 도 1을 참조로 설명한 웹 페이지 분석 장치(100)를 포함할 수 있다. 즉, 상기 웹 페이지 분석부(202)는, 웹 페에지를 적어도 하나의 블럭으로 분할하는 분할부(2021), 블럭으로부터 블럭에 포함된 단어와 그 빈도수에 관한 단어 데이터, 및 블럭에 포함된 단어의 품사와 그 빈도수에 관한 품사 데이터 중 적어도 하나를 추출하는 추출부(2022), 및 단어 데이터 및 품사 데이터 중 적어도 하나를 기반으로 블럭을 분류하는 분류부(2023)를 포함할 수 있다. The web page analyzing unit 202 may analyze the web page to divide the web page into blocks and classify the blocks. The web page analyzing unit 202 may include the web page analyzing apparatus 100 described with reference to FIG. That is, the web page analyzing unit 202 includes a dividing unit 2021 for dividing a web page into at least one block, word data included in the block and word data related to the frequency, An extracting unit 2022 for extracting at least one of part of speech data and part-of-speech data about the frequency, and a classifying unit 2023 for classifying a block based on at least one of word data and part-of-speech data.

상기 정보 제공부(203)는 상기 웹 페이지를 구성하는 블럭의 유형에 대한 정보를 제공할 수 있다. 본 발명의 일 실시예에 따르면, 상기 블럭의 유형은 (i) 인물의 소속 또는 직위에 대한 정보를 포함하는 기본 블럭; (ii) 인물의 연락처에 대한 정보를 포함하는 연락처 블럭; (iii) 인물의 사진 이미지를 포함하는 사진 블럭; (iv) 인물이 집필한 글에 대한 정보를 포함하는 글 블럭; (v) 인물의 소속 또는 직위에 대한 정보, 인물의 연락처에 대한 정보, 인물의 사진 이미지, 인물이 집필한 글에 대한 정보 중 둘 이상을 포함하는 혼합 블럭; 및 (vi) 상기 정보 중 어느 것도 포함하지 않는 비정보 블럭으로 구성될 수 있다. 하지만, 상기 블럭의 유형은 이에 제한되지 않고 실시예에 따라 다양하게 정의될 수 있다. The information providing unit 203 may provide information on the type of blocks constituting the web page. According to an embodiment of the present invention, the type of the block includes: (i) a basic block including information on the position or position of the person; (ii) a contact block containing information about the contact of the person; (iii) a photo block containing a photographic image of the person; (iv) a text block containing information about the article written by the person; (v) a mixing block including at least two of information on a person's position or position, information on a contact of a person, a photographic image of a person, and information on a manuscript; And (vi) non-information blocks that do not contain any of the above information. However, the type of the block is not limited thereto and may be variously defined according to embodiments.

상기 프로필 작성부(204)는 상기 블럭의 유형에 대한 정보 및 상기 블럭의 컨텐츠를 기반으로 인물의 프로필을 작성할 수 있다. 상기 인물의 프로필은, 인물의 이름, 소속, 직위, 연락처 및 사진 중 적어도 하나를 포함할 수 있으며, 실시예에 따라 인물의 이력이나 집필 목록과 같은 사항도 더 포함할 수 있다. The profile creator 204 may create a profile of a person based on the information about the type of the block and the contents of the block. The profile of the person may include at least one of the name, affiliation, position, contact, and photograph of the person, and may further include items such as a person's history or an article list depending on the embodiment.

상기 인물 프로필 작성 장치(200)는 저장부(205)를 더 포함할 수 있다. 상기 저장부(205)는 웹 페이지 분석에 사용되는 각종 데이터를 저장할 수 있다. 예를 들어, 사전에 정의된 블럭 유형에 대한 기준 단어 데이터 또는 기준 품사 데이터를 저장할 수 있다. 상기 분류부(2023)는 저장부(205)로부터 상기 기준 데이터를 독출하여, 추출부(2022)가 추출한 블럭의 단어 데이터 또는 품사 데이터와 비교하고, 웹 페이지의 블럭들을 사전에 정의된 블럭 유형별로 분류할 수 있다. The person profile creating apparatus 200 may further include a storage unit 205. The storage unit 205 may store various data used for web page analysis. For example, reference word data or reference part-of-speech data for a predefined block type may be stored. The classifying unit 2023 reads the reference data from the storage unit 205, compares the word data or the part-of-speech data of the blocks extracted by the extracting unit 2022, and compares the blocks of the web page for each predefined block type. Can be classified.

전술한 본 발명의 일 실시예에 따른 웹 페이지 분석 방법(70)은, 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있다. 상기 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 저장 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있다.The web page analysis method 70 according to an embodiment of the present invention may be stored in a computer-readable recording medium that is manufactured as a program to be executed in a computer. The computer-readable recording medium includes all kinds of storage devices in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like.

이상에서, 웹 페이지를 다수의 블럭으로 분할하고, 각 블럭의 컨텐츠로부터 블럭의 특성을 나타내는 특징값, 예컨대 단어 벡터, 단어의 품사 분포를 추출하고, 상기 특징값을 기반으로 블럭을 유형별로 분류하는 본 발명의 일 실시예에 따른 웹 페이지 분석 장치, 웹 페이지 분석 방법 및 웹 페이지 분석 장치를 사용하는 인물 프로필 작성 장치를 설명하였다. 본 발명에 따르면, 정형화된 포맷으로 구성되지 않고 다양한 주제를 다루는 웹 페이지, 예컨대 인물을 소개하는 홈페이지로부터 사용자가 요구하는 정보를 제공하는 정보 블럭을 정확하게 식별할 수 있다. 나아가, 상기 웹 페이지를 단순히 정보 블럭과 비정보 블럭으로만 양분하지 않고, 정보 블럭을 그 컨텐츠에 따라 다양한 유형으로 세분할 수 있는 효과를 얻을 수 있다. In the above, the web page is divided into a plurality of blocks, and feature values representing the characteristics of the blocks, such as word vectors and word parts of speech distribution, are extracted from the contents of each block, and the blocks are classified by type based on the feature values. An apparatus for creating a person profile using a web page analyzing apparatus, a web page analyzing method, and a web page analyzing apparatus according to an embodiment of the present invention has been described. According to the present invention, it is possible to accurately identify an information block that provides information requested by a user from a web page dealing with various topics, for example, a homepage introducing a person, without being configured in a formal format. Furthermore, it is possible to obtain an effect of not only dividing the web page into information blocks and non-information blocks, but also dividing the information block into various types according to its contents.

또한, 본 발명에 따른 웹 페이지 분석 장치 및 방법을 사용하여 인물의 홈페이지를 분석하는 경우, 해당 인물의 프로필 작성에 사용되는 정보를 보다 정확하고 신속하게 추출할 수 있다. Further, when analyzing the homepage of a person using the web page analyzing apparatus and method according to the present invention, it is possible to extract information used for profile creation of the person more accurately and quickly.

100: 웹 페이지 분석 장치 101: 분할부
102: 추출부 103: 분류부
200: 인물 프로필 작성 장치 201: 수신부
202: 웹 페이지 분석부 203: 정보 제공부
204: 프로필 작성부 205: 저장부100: Web page analyzing apparatus 101:
102: extracting unit 103:
200: person profile creating apparatus 201: receiving section
202: Web page analysis unit 203: Information providing unit
204: profile creation unit 205: storage unit

Claims

A division unit for dividing the web page into at least one block;
Extracting at least one of words included in the block and word data related to the frequency of the word and part of speech data relating to the part of speech included in the block and the frequency of the part of speech; And
A classification unit classifying the block based on at least one of the word data and the part-of-speech data;
The web page analyzing apparatus comprising:

The method according to claim 1,
Wherein the extracting unit comprises:
And a word vector generator for detecting a keyword having a high occurrence frequency among words included in the block, and generating a word vector having the frequency of the keyword as a component of the vector.

3. The method of claim 2,
And the word vector generator is configured to remove stop words from words included in the block before detecting the key word.

The method according to claim 1,
Wherein the extracting unit comprises:
And a part-of-speech distribution calculating unit for determining a part of words included in the block and calculating a ratio of parts-of-speech frequency to the number of words included in the block.

5. The method of claim 4,
Wherein the part-of-speech distribution calculating unit determines the part-of-speech word as one of a normal noun, a proper noun, a pronoun, a preposition, a conjunction, a verb, an investigation, an adverb, an adjective, an investigation, an adjective and an admirable.

The method according to claim 1,
Wherein the web page includes a homepage for introducing a person.

The method according to claim 6,
Wherein the classifying unit comprises:
A basic block containing information about at least one of a person's position and position;
A contact block containing information about the contact of the person;
A photo block containing a photo image of a person;
A text block containing information about the article written by the person; And
A mixing block including at least two of information on the person's position or position, information on the contact of the person, a photograph image of the person, and information on the article written by the person;
The web page analyzing apparatus comprising:

8. The method of claim 7,
Wherein the classifying unit classifies the block into a non-information block if the block does not correspond to any one of the basic block, the contact block, the photo block, the text block, and the mixed block.

8. The method of claim 7,
The classification unit compares the word data or the part-of-speech data of the block with reference word data or reference part-of-speech data set for each of the basic block, the contact block, the photo block, the writing block, and the mixed block. And classifying the block into the matched reference word data or blocks corresponding to the reference part-of-speech data when the reference word data matches or the part-of-speech data matches the reference part-of-speech data.

A web page analysis method comprising dividing a web page into blocks and classifying the blocks based on feature values extracted from the blocks.

11. The method of claim 10,
The web page analysis method is:
Dividing the web page into at least one block;
Extracting at least one of a word included in the block and word data relating to a frequency of the word, and a part of speech included in the block and part of speech data relating to a frequency of the part of speech; And
Classifying the block based on at least one of the word data and the part-of-speech data;
/ RTI >

12. The method of claim 11,
Extracting at least one of the word data and the part of speech data may include:
Removing stop words from words included in the block;
Detecting a key word having a high frequency of appearance among words included in the block; And
Generating a word vector including the frequency of the key word as a component of a vector;
/ RTI >

12. The method of claim 11,
Extracting at least one of the word data and the part of speech data may include:
Determining a part of the word included in the block; And
Calculating a ratio of parts-of-speech frequency to the number of words included in the block;
/ RTI >

12. The method of claim 11,
Wherein the web page includes a homepage for introducing a person.

15. The method of claim 14,
Wherein classifying the blocks comprises:
(i) the basic block including information on the person's position or position; A contact block containing information about the contact of the person; A photo block containing a photo image of a person; A text block containing information about the article written by the person; And a mixing block including at least two of information on the person's position or position, information on the contact of the person, photo image of the person, and information on the article written by the person. And
(ii) classifying the block into a non-information block if the block does not correspond to any one of the basic block, the contact block, the photo block, the text block, and the mixed block;
/ RTI >

16. The method of claim 15,
Wherein step (i) comprises:
Comparing word data or part-of-speech data of the block with reference word data or reference part-of-speech data set for each of the basic block, the contact block, the picture block, the text block, and the mixed block;
Detecting reference word data or reference parts data matching the word data or the part-of-speech data; And
Classifying the block into blocks corresponding to the matched reference word data or reference parts data;
/ RTI >

A receiving unit for receiving a web page introducing a person;
A web page analyzer for dividing the web page into blocks and analyzing the web page to classify the blocks;
An information provider for providing information on the type of blocks constituting the web page; And
A profile creation unit which creates a profile of the person based on information on the type of the block and the contents of the block;
Wherein the web page analyzing unit comprises:
A division unit for dividing the web page into at least one block;
Extracting at least one of words included in the block and word data related to the frequency of the word and part of speech data relating to the part of speech included in the block and the frequency of the part of speech; And
A classification unit classifying the block based on at least one of the word data and the part-of-speech data;
And a person profile creation device.

The method of claim 17,
And the receiving unit receives the HTML code or the XML code of the web page.

The method of claim 17,
Wherein the receiving unit receives the URL address of the web page and accesses the URL address to receive the HTML code or the XML code of the web page.

A computer-readable recording medium,
Dividing a web page into at least one block;
Extracting at least one of a word included in the block and word data relating to a frequency of the word, and a part of speech included in the block and part of speech data relating to a frequency of the part of speech; And
Classifying the block based on at least one of the word data and the part-of-speech data;
And a program code for executing the web page analysis process.