KR101575113B1

KR101575113B1 - Method Apparatus And Computer-Readable Recording Medium with Program for Extracting Content with Web Page

Info

Publication number: KR101575113B1
Application number: KR1020080100833A
Authority: KR
Inventors: 이준호
Original assignee: 에스케이플래닛 주식회사
Priority date: 2008-10-14
Filing date: 2008-10-14
Publication date: 2015-12-09
Also published as: KR20100041584A

Abstract

본 발명은 웹 페이지 내의 컨텐츠를 추출하기 위한 방법, 장치 및 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a method, an apparatus and a computer-readable recording medium for extracting contents in a web page.

본 발명은 기 설정된 피드(Feed)에서 추출된 아이템(Item) 정보의 URL 페이지가 수집 대상이 아닌 경우, 기 설정된 프로그램 또는 명령어를 이용하여 페이지를 구성하며, 구성된 페이지에서 상기 아이템 정보의 클루(Clue) 정보를 이용하여 상기 클루 정보를 포함하는 페이지만을 추적하는 제 1 페이지 추적부(Chase Page); 추적된 페이지에서 상기 클루 정보를 이용하여 기 설정된 추출 규칙에 따른 메인 컨텐츠(Main Content) 영역을 추출하는 컨텐츠 검출부(Content Detection); 피드 룰(Feed Rule) 파일을 읽어들여 해당 피드의 기존 분석 정보를 읽어들이고, 추출된 상기 메인 컨텐츠 영역 또는 상기 클루 정보와 비교하여, 부족하거나 잘못된 부분이 있는 경우, 상기 기존 분석 정보를 이용하여 상기 메인 컨텐츠 영역을 재조종하는 컨텐츠 리비젼부(Content Revision) 및 추출 또는 재조정된 상기 메인 컨텐츠를 클라이언트에게 전송하는 결과 전송부(Send Result)를 포함하는 것을 특징으로 하는 웹 페이지 내의 컨텐츠 추출을 위한 장치를 제공한다.In the present invention, when a URL page of item information extracted from a preset feed is not a collection target, a page is configured using a preset program or a command, and a clue A first page tracking unit for tracking only the page including the clue information using information of the clue information; A content detector for extracting a main content area according to a preset extraction rule using the clue information in a tracked page; Reads a feed rule file, reads existing analysis information of the feed, compares the extracted analysis information with the extracted main content area or the clue information, and if there is an insufficient or wrong part, A content revision unit for re-controlling the main content area, and a result transmission unit for transmitting the extracted or re-adjusted main content to a client. to provide.

RSS, 컨텐츠 추출, 페이지 추적, 리비젼 RSS, content extraction, page tracking, revision

Description

METHOD, APPARATUS AND COMPUTER-READABLE RECORDING MEDIUM FOR RETRIEVING CONTENT IN A WEB PAGE [0002]

본 발명은 웹 페이지 내의 컨텐츠를 추출하기 위한 방법, 장치 및 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다. 더욱 상세하게는, RSS 형태로 제공되는 다양한 컨텐츠 페이지를 대상으로, RSS의 요약 정보를 활용하여 사용자가 원하는 본문페이지만을 추출하여 제공하고, 일반적인 RSS 서비스에서 특정 아이템의 원문 페이지 구독 기능을 제공함에 있어 웹 브라우저를 통한 대상 페이지 전체의 구독 기능을 제공하기 힘든 환경에서 텍스트와 이미지 위주의 본문 영역만을 추출하여 관련 원문 페이지의 구독 기능을 용이하게 구현하며, 대상 페이지의 용량을 대폭 절감하고 구조를 단순화함으로써 RSS 기반의 웹 컨텐츠를 제공하기 위해 RSS 내의 특정 페이지에 대한 요약 정보를 바탕으로 대상 웹페이지에서 사용자가 실제 컨텐츠라고 인식하는 내용만을 추출하기 위한 웹 페이지 내의 컨텐츠를 추출하기 위한 방법, 장치 및 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a method, an apparatus and a computer-readable recording medium for extracting contents in a web page. More particularly, the present invention relates to a method and apparatus for extracting and providing only a body page desired by a user by using summary information of RSS to various contents pages provided in an RSS form, and providing a function of subscribing to a original page of a specific item in a general RSS service It is possible to extract only the text area based on text and images in a situation where it is difficult to provide a subscription function of a target page through a web browser, thereby easily implementing the subscription function of the related original page, greatly reducing the capacity of the target page, A method, an apparatus, and a computer for extracting contents in a web page for extracting only contents recognized as actual contents by a user on a target web page based on summary information about a specific page in RSS to provide RSS-based web contents A readable recording medium .

최근들어, 인터넷의 비약적인 발전을 통하여 많은 정보가 공유되어가고 있 다. 이러한 인터넷의 보급과 맞물려 인터넷 상에 정보 및 컨텐츠를 제공함으로써 새로운 이윤을 창출하는 다양한 컨텐츠 제공업체들이 등장하고 있다.Recently, a lot of information has been shared through the rapid development of the Internet. With the spread of the Internet, a variety of content providers are emerging that generate new profits by providing information and contents on the Internet.

이러한 다양한 컨텐츠 제공업체들이 제공하는 다양한 정보 및 컨텐츠들은 시시각각 새로운 정보로 업데이트되고 있으며, 이용자들은 새롭게 업데이트 정보를 획득하기 위하여 컨텐츠 제공업체들의 사이트에 접속하여 정보를 검색하고 있다. 여기서, 이용자들은 새롭게 업데이트 정보만을 획득하기 원하는 경우에도 불구하고, 컨텐츠 제공업체들로부터 제공되는 업데이트 정보 및 이전 정보를 모두 다운로드 받아서 보게 된다. 이러한 이용자들의 욕구를 만족시키기 위하여 RSS가 사용되고 있다.Various information and contents provided by various contents providers are being updated with new information every moment, and users are searching for information by accessing contents providers' sites in order to acquire updated information. Here, although the users want to acquire only the update information, the users download all the update information and the previous information provided from the content providers. RSS is used to satisfy the users' desires.

RSS는 컨텐츠 배급과 수집에 관한 표준 포맷으로서, 사전적 의미는 Really Simple Syndication 또는 Rich Site Summary의 머리글자이며 , XML기반의 표준 통신 포맷이다. 즉, RSS는 뉴스나 블로그와 같이 컨텐츠 업데이트가 자주 일어나는 웹사이트에서, 업데이트 정보를 쉽게 사용자들에게 제공하기 위해 XML을 기초로 만들어진 데이터 형식이다. RSS is a standard format for content distribution and collection. Dictionary meaning is an acronym for Really Simple Syndication or Rich Site Summary. It is an XML-based standard communication format. That is, RSS is a data format based on XML in order to easily provide updated information to users on websites where content updates frequently occur, such as news and blogs.

RSS 방식은 사이트가 제공하는 주소를 자신의 RSS 피드에 등록하면, 해당 사이트의 업데이트 정보를 찾기 위해 사이트에 매번 방문할 필요 없이 RSS 피드를 이용하여 해당 사이트의 업데이트 정보를 확인할 수 있다. If you register the address provided by the site in your RSS feed, you can check the update information of the site using the RSS feed without having to visit the site every time to find the update information of the site.

현재 RSS 방식은 국제 표준으로 자리매김하고 있으며, 이러한 RSS 방식을 이용한 다양한 응용사업 창출이 기대되고 있다. 이에 따라, RSS 방식을 적용한 다양한 서비스 개발이 필요한 실정이다.Currently, RSS is becoming an international standard, and it is anticipated that various applications will be created using this RSS method. Accordingly, it is necessary to develop various services using the RSS method.

전술한 문제점을 해결하기 위해 본 발명은, RSS 형태로 제공되는 다양한 컨텐츠 페이지를 대상으로, RSS의 요약 정보를 활용하여 사용자가 원하는 본문페이지만을 추출하여 제공하고, 일반적인 RSS 서비스에서 특정 아이템의 원문 페이지 구독 기능을 제공함에 있어 웹 브라우저를 통한 대상 페이지 전체의 구독 기능을 제공하기 힘든 환경에서 텍스트와 이미지 위주의 본문 영역만을 추출하여 관련 원문 페이지의 구독 기능을 용이하게 구현하며, 대상 페이지의 용량을 대폭 절감하고 구조를 단순화함으로써 RSS 기반의 웹 컨텐츠를 제공하기 위해 RSS 내의 특정 페이지에 대한 요약 정보를 바탕으로 대상 웹페이지에서 사용자가 실제 컨텐츠라고 인식하는 내용만을 추출하기 위한 웹 페이지 내의 컨텐츠를 추출하기 위한 방법, 장치 및 컴퓨터로 읽을 수 있는 기록매체를 제공하는 데 주된 목적이 있다.In order to solve the above-described problems, the present invention provides a method of extracting and providing only a body page desired by a user by using summary information of RSS, targeting various content pages provided in an RSS format, In order to provide a subscription function, it is possible to easily extract subscription functions of related original pages by extracting only text and image-based main text areas in an environment where it is difficult to provide a subscription function of a target page through a web browser, In order to provide RSS-based web contents by reducing the structure and simplifying the structure, it is necessary to extract contents in the web page for extracting only the contents recognized by the user in the target web page based on the summary information about the specific page in RSS Methods, devices, and computer readable The main objective is to provide a recording medium.

전술한 목적을 달성하기 위해 본 발명은, 기 설정된 피드(Feed)에서 추출된 아이템(Item) 정보의 URL 페이지가 수집 대상이 아닌 경우, 기 설정된 프로그램 또는 명령어를 이용하여 페이지를 구성하며, 구성된 페이지에서 상기 아이템 정보의 클루(Clue) 정보를 이용하여 상기 클루 정보를 포함하는 페이지만을 추적하는 제 1 페이지 추적부(Chase Page); 추적된 페이지에서 상기 클루 정보를 이용하여 기 설정된 추출 규칙에 따른 메인 컨텐츠(Main Content) 영역을 추출하는 컨텐츠 검출 부(Content Detection); 피드 룰(Feed Rule) 파일을 읽어들여 해당 피드의 기존 분석 정보를 읽어들이고, 추출된 상기 메인 컨텐츠 영역 또는 상기 클루 정보와 비교하여, 부족하거나 잘못된 부분이 있는 경우, 상기 기존 분석 정보를 이용하여 상기 메인 컨텐츠 영역을 재조종하는 컨텐츠 리비젼부(Content Revision); 추출 또는 재조정된 상기 메인 컨텐츠를 파일로서 저장하는 컨텐츠 저장부(Store Content); 및 추출 또는 재조정된 상기 메인 컨텐츠를 클라이언트에게 전송하는 결과 전송부(Send Result)를 포함하는 것을 특징으로 하는 웹 페이지 내의 컨텐츠 추출을 위한 장치를 제공한다.In order to achieve the above object, the present invention forms a page using a predetermined program or a command if a URL page of item information extracted from a preset feed is not a collection target, A first page tracking unit for tracking only the page including the clue information using clue information of the item information; A content detection unit for extracting a main content area according to a preset extraction rule using the clue information in a tracked page; Reads a feed rule file, reads existing analysis information of the feed, compares the extracted analysis information with the extracted main content area or the clue information, and if there is an insufficient or wrong part, A content revision unit for re-controlling the main content area; A content storage unit (Store Content) for storing the main content extracted or readjusted as a file; And a result transmitting unit for transmitting the extracted or re-adjusted main content to a client.

또한, 본 발명의 다른 목적에 의하면, (a) 페이지 추적부가 기 설정된 피드(Feed)에서 추출된 아이템(Item) 정보의 URL 페이지가 수집 대상이 아닌 경우, 기 설정된 프로그램 또는 명령어를 이용하여 페이지를 구성하는 단계; (b) 상기 페이지 추적부가 구성된 페이지에서 상기 아이템 정보의 클루(Clue) 정보를 이용하여 상기 클루 정보를 포함하는 페이지만을 추적하는 단계; (c) 컨텐츠 검출부가 추적된 페이지에서 상기 클루 정보를 이용하여 기 설정된 추출 규칙에 따른 메인 컨텐츠(Main Content) 영역을 추출하는 단계; (d) 컨텐츠 리비젼부가 피드 룰(Feed Rule) 파일을 읽어들여 해당 피드의 기존 분석 정보를 읽어들이고, 추출된 상기 메인 컨텐츠 영역 또는 상기 클루 정보와 비교하여, 부족하거나 잘못된 부분이 있는 경우, 상기 기존 분석 정보를 이용하여 상기 메인 컨텐츠 영역을 재조종하는 단계; (e) 컨텐츠 저장부가 추출 또는 재조정된 상기 메인 컨텐츠를 파일로서 저장하는 단계; 및 (f) 결과 전송부가 추출 또는 재조정된 상기 메인 컨텐츠를 클라이언트에 게 전송하는 단계를 포함하는 것을 특징으로 하는 웹 페이지 내의 컨텐츠 추출을 위한 방법을 제공한다.According to another aspect of the present invention, there is provided a program for causing a computer to execute the steps of: (a) if a URL page of item information extracted from a preset feed is not a collection target, Comprising; (b) tracking only the page including the clue information using the clue information of the item information in a page configured by the page tracking unit; (c) extracting a main content area according to a preset extraction rule using the clue information in a page on which the content detection unit tracks; (d) reading the existing analysis information of the feed by reading the feed rule file of the content revision, comparing the extracted analysis information with the extracted main content area or the clue information, and if there is an insufficient or wrong part, Redistributing the main content area using analysis information; (e) storing the main content extracted or re-adjusted by the content storage unit as a file; And (f) transmitting the main content, which is extracted or re-adjusted, to a client.

또한, 본 발명의 다른 목적에 의하면, 웹 페이지 내의 컨텐츠 추출을 위한 장치를 실행시키기 위한 프로그램을 기록한 기록매체에 있어서, 페이지 추적부가 기 설정된 피드(Feed)에서 추출된 아이템(Item) 정보의 URL 페이지가 수집 대상이 아닌 경우, 기 설정된 프로그램 또는 명령어를 이용하여 페이지를 구성하는 기능; 상기 페이지 추적부가 구성된 페이지에서 상기 아이템 정보의 클루(Clue) 정보를 이용하여 상기 클루 정보를 포함하는 페이지만을 추적하는 기능; 컨텐츠 검출부가 추적된 페이지에서 상기 클루 정보를 이용하여 기 설정된 추출 규칙에 따른 메인 컨텐츠(Main Content) 영역을 추출하는 기능; 컨텐츠 리비젼부가 피드 룰(Feed Rule) 파일을 읽어들여 해당 피드의 기존 분석 정보를 읽어들이고, 추출된 상기 메인 컨텐츠 영역 또는 상기 클루 정보와 비교하여, 부족하거나 잘못된 부분이 있는 경우, 상기 기존 분석 정보를 이용하여 상기 메인 컨텐츠 영역을 재조종하는 기능; 컨텐츠 저장부가 추출 또는 재조정된 상기 메인 컨텐츠를 파일로서 저장하는 기능; 및 결과 전송부가 추출 또는 재조정된 상기 메인 컨텐츠를 클라이언트에게 전송하는 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.According to another aspect of the present invention, there is provided a recording medium on which a program for executing a device for extracting contents in a web page is recorded, the page tracker including a URL page of item information extracted from a pre- A function of composing a page using a predetermined program or command when the target is not a collection target; A function of tracking only a page including the clue information using clue information of the item information in a page configured by the page tracking unit; A function of extracting a main content area according to a preset extraction rule using the clue information in a page on which the content detection unit tracks; Reads the existing analysis information of the feed by reading a feed rule file of the content revision section, compares the extracted analysis information with the extracted main content area or the clue information, and if there is an insufficient or wrong part, A function of re-controlling the main content area by using the main content area; A function of storing the main content extracted or re-adjusted by the content storage unit as a file; And a program for realizing the function of transmitting the main content extracted or re-adjusted by the result transmitting unit to the client.

이상에서 설명한 바와 같이 본 발명에 의하면, RSS 내의 특정 페이지에 대한 요약 정보를 바탕으로 대상 웹페이지에서 사용자가 실제 컨텐츠라고 인식하는 내용만을 추출할 수 있는 효과가 있다.As described above, according to the present invention, it is possible to extract only content that a user recognizes as actual content on a target web page based on summary information about a specific page in RSS.

또한, 본 발명에 의하면, RSS 형태로 제공되는 다양한 컨텐츠 페이지를 대상으로, RSS의 요약 정보를 활용하여 사용자가 원하는 본문페이지만을 추출하여 제공할 수 있는 효과가 있다.In addition, according to the present invention, it is possible to extract and provide only a body page desired by a user by using summary information of RSS to various contents pages provided in an RSS form.

또한, 본 발명에 의하면, 일반적인 RSS 서비스에서 특정 아이템의 원문 페이지 구독 기능을 제공함에 있어 웹 브라우저를 통한 대상 페이지 전체의 구독 기능을 제공하기 힘든 환경에서 텍스트와 이미지 위주의 본문 영역만을 추출하여 관련 원문 페이지의 구독 기능을 용이하게 구현할 수 있는 효과가 있다.In addition, according to the present invention, in providing a function of subscribing to a text page of a specific item in a general RSS service, it is possible to extract only a text area focused on text and images in an environment where it is difficult to provide a subscription function of a whole page through a web browser, The subscription function of the page can be easily implemented.

또한, 본 발명에 의하면, 대상 페이지의 용량을 대폭 절감하고 구조를 단순화함으로써 RSS 기반의 웹 컨텐츠를 제공할 수 있는 효과가 있다.In addition, according to the present invention, it is possible to provide RSS-based web content by greatly reducing the capacity of the target page and simplifying the structure.

이하, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. It should be noted that, in adding reference numerals to the constituent elements of the drawings, the same constituent elements are denoted by the same reference numerals even though they are shown in different drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 또는 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In describing the components of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are intended to distinguish the constituent elements from other constituent elements, and the terms do not limit the nature, order or order of the constituent elements. When a component is described as being "connected", "coupled", or "connected" to another component, the component may be directly connected to or connected to the other component, It should be understood that an element may be "connected," "coupled," or "connected."

도 1은 본 발명의 실시예에 따른 웹 페이지 내의 컨텐츠를 추출하기 위한 장치를 개략적으로 나타낸 블럭 구성도이다.1 is a block diagram schematically showing an apparatus for extracting contents in a web page according to an embodiment of the present invention.

도 1은 본 발명의 실시예에 따른 웹 페이지 내의 컨텐츠를 추출하기 위한 장치는 체크 캐쉬부(Check Cache)(112), 제 1 페이지 추적부(Chase Page)(114), 컨텐츠 검출부(Content Detection)(120), 컨텐츠 리비전부(Content Revision)(122), 컨텐츠 저장부(Store Content)(130) 및 결과 전송부(Send Result)(132), 피드 레지스트레이션부(Feed Registration)(140), 아이템 추출부(Extract Item)(142), 제 2 페이지 추적부(Chase Page)(144), 패턴 분석부(Pattern Analysis)(146) 및 패턴 저장부(Store Pattern)(148)를 포함한다.1 shows an apparatus for extracting contents in a web page according to an embodiment of the present invention includes a check cache unit 112, a first page tracker 114, a content detector unit 114, A Content Revision 122, a Store Content 130 and a Send Result 132, a Feed Registration 140, an Item Extraction 120, a Content Revision 122, A second page trace section 144, a pattern analysis section 146 and a pattern store section 148. The first page trace section 144, the second page trace section 144, the pattern analysis section 146,

본 발명에서는 웹 페이지 내의 컨텐츠를 추출하기 위한 장치가 체크 캐쉬부(112), 제 1 페이지 추적부(114), 컨텐츠 검출부(120), 컨텐츠 리비전부(122), 컨텐츠 저장부(130) 및 결과 전송부(132), 피드 레지스트레이션부(140), 아이템 추출부(142), 제 2 페이지 추적부(144), 패턴 분석부(146) 및 패턴 저장부(148)만을 포함하여 구성되는 것으로 기재하고 있으나, 이는 본 발명의 기술 사상을 예시적으 로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 웹 페이지 내의 컨텐츠를 추출하기 위한 장치에 포함되는 구성 요소에 대하여 다양하게 수정 및 변형하여 적용 가능할 것이다.In the present invention, an apparatus for extracting contents in a web page includes a check cache unit 112, a first page tracking unit 114, a content detection unit 120, a content revision unit 122, a content storage unit 130, It is described that only the transmitting unit 132, the feed registration unit 140, the item extracting unit 142, the second page tracking unit 144, the pattern analyzing unit 146, and the pattern storing unit 148 are included However, it should be understood that the technical idea of the present invention is merely illustrative and that those skilled in the art will appreciate that various modifications and changes may be made thereto without departing from the essential characteristics of the present invention. It will be understood that the invention may be varied in many ways.

메인 컨텐츠 검출부(Main Content Detection)(110)는 컨텐츠 추출 요청을 입력받는 기능을 수행한다.The main content detection unit (Main Content Detection) 110 receives a content extraction request.

즉, 메인 컨텐츠 검출부(110)는 RSS를 구현하는 코어로부터 컨텐츠 추출에 대한 요청을 입력받는 것이다.That is, the main content detecting unit 110 receives a request for content extraction from a core that implements RSS.

체크 캐쉬부(112)는 요청받은 페이지에 대해 이전에 추출된 페이지가 있는지의 여부를 확인하고, 확인 결과, 이전에 추출된 페이지가 있을 경우, 이전에 추출된 페이지를 읽어들여 전송하는 기능을 수행한다.The check cache unit 112 checks whether or not there is a previously extracted page for the requested page, and if the previously extracted page exists, the check cache unit 112 reads the previously extracted page and transmits the read page do.

본 발명의 실시예에 따른 제 1 페이지 추적부(114)는 기 설정된 피드(Feed)에서 추출된 아이템(Item) 정보의 URL 페이지가 수집 대상이 아닌 경우, 기 설정된 프로그램 또는 명령어를 이용하여 페이지를 구성하며, 구성된 페이지에서 아이템 정보의 클루(Clue) 정보를 이용하여 클루 정보를 포함하는 페이지만을 추적하는 기능을 수행한다.The first page tracking unit 114 according to the embodiment of the present invention may be configured such that when a URL page of item information extracted from a preset feed is not a collection target, And performs a function of tracking only pages including clue information by using clue information of item information in a configured page.

여기서, 기 설정된 프로그램 또는 명령어는 리다이렉트(Redirect), 프레임(Frame), 아이프레임(Iframe) 및 자바스크립트(JavaScript) 중 적어도 하나일 수 있다.Here, the predetermined program or command may be at least one of a redirect, a frame, an Iframe, and JavaScript.

또한, 여기서, 클루 정보는 제목, 본문 요약 정보 중 적어도 하나 이상의 정 보를 포함할 수 있다.Here, the clue information may include at least one of the title and the body summary information.

본 발명의 실시예에 따른 제 1 페이지 추적부(114)는 제목과 본문 요약 정보에 대한 공백문자 및 특수문자를 구분하여 검사대상 단어목록 생성하고, 아이템 정보의 URL에 해당하는 페이지를 해당 웹 사이트로부터 수집하는 기능을 수행한다.The first page tracking unit 114 according to the embodiment of the present invention generates a list of words to be inspected by separating blank characters and special characters for title and body summary information and outputs a page corresponding to the URL of the item information to the corresponding website As shown in FIG.

본 발명의 실시예에 따른 제 1 페이지 추적부(114)는 수집된 페이지의 본문 텍스트에 제목과 본문 요약 정보가 포함되었는지의 여부 확인하고, 확인 결과, 제목과 본문 요약 정보가 포함되어 있는 경우, 제목과 본문 요약 정보를 포함한 페이지를 추적 완료하며, 확인 결과, 제목과 본문 요약 정보가 포함되어 있지 않은 경우, 해당 페이지를 분석하여 기 설정된 프로그램 또는 명령어를 이용하여 URL 추출하고, 추출된 URL을 이용하여 페이지 수집하는 기능을 수행한다.The first page tracking unit 114 according to an embodiment of the present invention checks whether the title text and the body summary information are included in the body text of the collected page. If the title and body summary information are included in the checked result, If the title and body summary information are not included, the page is analyzed, the URL is extracted using the predetermined program or command, and the extracted URL is used And performs page collection function.

본 발명의 실시예에 따른 제 1 페이지 추적부(114)는 수집된 페이지의 본문 텍스트에 제목과 본문 요약 정보가 포함되었는지의 여부 확인하기 위해 수집된 페이지의 바디(Body) 태그 영역 내의 텍스트에 클루 정보의 단어를 포함하고 있는지의 여부를 검사하는 기능을 수행한다.The first page tracking unit 114 according to the embodiment of the present invention searches for the text in the body tag area of the collected page in order to check whether the title text and the body summary information are included in the body text of the collected page, And performs a function of checking whether or not the information word is included.

본 발명의 실시예에 따른 제 1 페이지 추적부(114)는 클루 정보에 해당하는 일부 단어를 포함한 페이지를 추적하는 기능을 수행한다.The first page tracking unit 114 according to the embodiment of the present invention performs a function of tracking a page including some words corresponding to the clue information.

본 발명의 실시예에 따른 컨텐츠 검출부(120)는 추적된 페이지에서 클루 정보를 이용하여 추적된 페이지의 제목을 포함하는 부분이 발견되는 경우, 상기 클루 정보의 제목과 일치하는 최하위 영역을 추출하고, 추출된 영역이 목록(Link List)인지의 여부를 확인하고, 확인 결과, 추출된 영역이 목록 형태일 경우, 상기 추출된 영역을 메인 컨텐츠 영역에서 제외하며, 상기 추출된 영역을 상위 노드(Node)로 확대해 가면서 링크(Link)가 존재하며 반복적인 패턴(Pattern)을 갖는 영역인지의 여부를 검사하여 메인 컨텐츠(Main Content) 영역을 추출하는 기능을 수행한다.The content detecting unit 120 extracts the lowest region coinciding with the title of the clue information when a portion including the title of the tracked page is found using the clue information in the tracked page, The extracted region is excluded from the main content region, and if the extracted region is a list form, the extracted region is referred to as an upper node, And extracts a main content region by examining whether or not a link exists and exists in a region having a repetitive pattern.

본 발명의 실시예에 따른 컨텐츠 검출부(120)는 제목을 포함하는 부분이 다 수 발견되는 경우, 클루 정보의 제목과 일치하는 최하위 영역을 추출하고, 추출된 영역이 목록(Link List)인지의 여부를 확인하고, 확인 결과, 추출된 영역이 목록 형태일 경우, 추출된 영역을 메인 컨텐츠 영역에서 제외하며, 추출된 영역을 상위 노드(Node)로 확대해 가면서 링크(Link)가 존재하며 반복적인 패턴(Pattern)을 갖는 영역인지의 여부를 검사하는 기능을 수행한다.The content detecting unit 120 extracts the lowest region coinciding with the title of the clue information and determines whether the extracted region is a list (Link List) If the extracted region is a list type, the extracted region is excluded from the main content region. If the extracted region is expanded to an upper node (Node), a link exists and a repetitive pattern And it is checked whether or not it is an area having a pattern.

여기서, 최하위 영역은 돔 트리의 마지막 노드를 의미한다.Here, the lowest region means the last node of the dom tree.

본 발명의 실시예에 따른 컨텐츠 검출부(120)는 주변 영역의 태그(Tag) 특성에 따라 우선 순위 부여하고, 제목을 포함한 노드를 상위 노드로 확장해가면서 검사하며, 확장된 상위 노드에서 제목을 포함하지 않는 하위 노드 또는 이웃 노드에 클루 정보의 본문 요약 정보가 존재하는지 검사하는 기능을 수행한다.The content detecting unit 120 according to the embodiment of the present invention prioritizes the tags according to the tag characteristics of the surrounding area, expands the node including the title to an upper node, And performs a function of checking whether the body summary information of the clue information exists in the lower node or the neighboring node that does not exist.

여기서, 상위 노드는 돔 트리상 페어런트 노트(Parent Node)를 의미한다.Here, the parent node means a parent node on the domed tree.

또한, 여기서, 우선 순위는 Emphasis > Normal > Link > Hidden 순일 수 있다.Here, the priority order may be Emphasis> Normal> Link> Hidden.

본 발명의 실시예에 따른 컨텐츠 검출부(120)는 선택된 제목 노드에서 컨텐츠 영역을 추출하지 못할 경우 다음 우선 순위의 제목 노드로 검사를 수행하는 기능을 수행한다.The content detector 120 according to an exemplary embodiment of the present invention performs a check with a title node having the next highest priority when the content node can not be extracted from the selected title node.

본 발명의 실시예에 따른 컨텐츠 검출부(120)는 컨텐츠 영역을 추출하지 못한 경우, 피드 등록 시 분석된 정보로 컨텐츠 영역을 추정하고, 돔 트리 구조가 일치할 경우 컨텐츠 영역으로 선택하는 기능을 수행한다.The content detector 120 according to an embodiment of the present invention estimates a content area using information analyzed at the time of registering a feed if the content area can not be extracted and selects a content area when the dome tree structure matches .

본 발명의 실시예에 따른 컨텐츠 검출부(120)는 돔 트리 구조가 일치하지 않 을 경우 텍스트가 많거나 큰 이미지가 존재하는 영역을 컨텐츠 영역으로 선택하고, 컨텐츠 이외의 댓글 영역 또는 광고 영역을 제거한 레이아웃(Layout)을 생성하는 기능을 수행한다.The content detecting unit 120 according to the embodiment of the present invention selects a region in which a large amount of text or a large image exists as a content region when the dom tree structure does not match, (Layout) is generated.

본 발명의 실시예에 따른 컨텐츠 검출부(120)는 HTML(HyperText Markup Language) 규약에 벗어나는 부분에 대해 보정 작업 실시하며, HTML 텍스트를 분석하여 돔 트리(DOM Tree)를 생성하는 기능을 수행한다.The content detecting unit 120 according to the embodiment of the present invention performs a correction operation on a part deviating from the HTML (HyperText Markup Language) protocol and analyzes the HTML text to generate a DOM Tree.

컨텐츠 리비젼부(122)는 피드 룰(Feed Rule) 파일을 읽어들여 해당 피드의 기존 분석 정보를 읽어들이고, 추출된 메인 컨텐츠 영역 또는 클루 정보와 비교하여, 부족하거나 잘못된 부분이 있는 경우, 기존 분석 정보를 이용하여 메인 컨텐츠 영역을 재조종하는 기능을 수행한다.The content revision unit 122 reads a feed rule file, reads existing analysis information of the corresponding feed, and compares the extracted analysis information with the extracted main content area or clue information. If there is an insufficient or wrong part, The main content area is redirected.

컨텐츠 저장부(130)는 추출 또는 재조정된 메인 컨텐츠를 파일로서 저장하는 기능을 수행한다.The content storage unit 130 stores the extracted or re-adjusted main content as a file.

결과 전송부(132)는 추출 또는 재조정된 메인 컨텐츠를 클라이언트에게 전송하는 기능을 수행한다.The result transmitting unit 132 transmits the extracted or re-adjusted main content to the client.

피드 레지스트레이션부(140)는 피드 입력수단으로부터 피드를 수신하는 기능을 수행한다.The feed registration unit 140 performs a function of receiving a feed from the feed input means.

아이템 추출부(142)는 수신된 피드로부터 아이템 정보를 추출하고, 수신된 URL에 해당하는 페이지로부터 해당 RSS 피드 페이지 수집하고, 수집된 RSS 피드 페이지에서 아이템 정보의 URL, 제목 및 본문 요약 정보 중 적어도 하나의 정보를 추출하는 기능을 수행한다.The item extracting unit 142 extracts the item information from the received feed, collects the corresponding RSS feed page from the page corresponding to the received URL, and stores at least one of the URL, title, and body summary information of the item information in the collected RSS feed page And performs a function of extracting one piece of information.

여기서, 피드는 ATOM, RSS 1.0 및 RSS 2.0 규격 중 적어도 하나의 규격을 지원할 수 있다.Here, the feed may support at least one of the ATOM, RSS 1.0, and RSS 2.0 standards.

제 2 페이지 추적부(144)는 URL에 해당하는 페이지가 수집 대상이 아닌 경우, 아이템 정보의 클루 정보를 이용하여 클루 정보가 포함된 페이지 추적하는 기능을 수행한다.If the page corresponding to the URL is not the object of collection, the second page tracking unit 144 performs a function of tracking the page including the clue information by using the clue information of the item information.

패턴 분석부(146)는 피드에서 추출된 복수 개의 아이템 정보별로 메인 컨텐츠를 추적하여 추출된 세그먼트(Segment)의 패턴(Pattern)을 분석하고, 제목이 일치하고 본문을 포함하는 메인 컨텐츠 영역을 우선 추출하는 기능을 수행한다.The pattern analyzing unit 146 analyzes the pattern of the extracted segment by tracking the main content according to a plurality of item information extracted from the feed, and extracts a main content area having the same title and including the main text, .

패턴 저장부(148)는 분석된 패턴을 파일로서 저장하는 기능을 수행한다. 여기서, 패턴 저장부(148)에서 저장된 패턴은 피드 룰(Feed Rule)로 저장되는 것이다.The pattern storage unit 148 stores the analyzed pattern as a file. Here, the pattern stored in the pattern storage unit 148 is stored as a feed rule.

도 2는 본 발명의 실시예에 따른 페이지 추적부의 기능을 설명하기 위한 순서도이다.2 is a flowchart illustrating a function of a page tracking unit according to an embodiment of the present invention.

제 1 페이지 추적부(114)는 기 설정된 피드(Feed)에서 추출된 아이템(Item) 정보의 URL 페이지가 수집 대상이 아닌 경우, 기 설정된 프로그램 또는 명령어를 이용하여 페이지를 구성하며, 구성된 페이지에서 아이템 정보의 클루(Clue) 정보를 이용하여 클루 정보를 포함하는 페이지만을 추적하는 기능을 수행한다.If the URL page of the item information extracted from the preset feed is not a collection target, the first page tracking unit 114 configures a page using a preset program or a command, It uses the information clue information to track only the pages that contain clue information.

여기서, 기 설정된 프로그램 또는 명령어는 리다이렉트(Redirect), 프레임(Frame), 아이프레임(Iframe) 및 자바스크립트(JavaScript) 중 적어도 하나일 수 있다. 또한, 여기서, 클루 정보는 제목, 본문 요약 정보 중 적어도 하나 이상의 정 보를 포함할 수 있다.Here, the predetermined program or command may be at least one of a redirect, a frame, an Iframe, and JavaScript. Here, the clue information may include at least one of the title and the body summary information.

제 1 페이지 추적부(114)는 제목과 본문 요약 정보에 대한 공백문자 및 특수문자를 구분하여 검사대상 단어목록 생성한다(S210).The first page tracking unit 114 generates a check target word list by separating blank characters and special characters of the title and body summary information (S210).

즉, 피드에서 제공하는 제목과 본문 요약 정보가 웹 페이지의 내용과 정확하게 일치하지않는 경우가 빈번하게 발생하므로, 제 1 페이지 추적부(114)는 제목과 본문 요약 정보에 대한 공백문자 및 특수문자를 구분하여 검사대상 단어목록을 생성하는 것이다.That is, since the title and the body summary information provided by the feed frequently do not exactly coincide with the contents of the web page, the first page tracking unit 114 generates blank characters and special characters for the title and body summary information And generates a list of words to be inspected.

제 1 페이지 추적부(114)는 아이템 정보의 URL에 해당하는 페이지를 해당 웹 사이트로부터 수집한다(S220).The first page tracking unit 114 collects the page corresponding to the URL of the item information from the corresponding website (S220).

제 1 페이지 추적부(114)는 추적 검사를 실시한다(S230).The first page tracking unit 114 performs a tracking check (S230).

즉, 본 발명에서는 제 1 페이지 추적부(114)는 추적 검사를 실시하는 방법을 4 가지로 방식으로 구분한다.That is, in the present invention, the first page tracking unit 114 divides the method of performing the tracking inspection into four methods.

제 1 페이지 추적부(114)는 수집된 페이지의 본문 텍스트에 제목과 본문 요약 정보가 포함되었는지의 여부 확인한다(S240).The first page tracking unit 114 checks whether the title text and the body summary information are included in the body text of the collected page (S240).

단계 S240의 확인 결과, 제목과 본문 요약 정보가 포함되어 있는 경우, 제목과 본문 요약 정보를 포함한 페이지를 추적 완료한다.As a result of checking in step S240, if the title and the body summary information are included, the page including the title and body summary information is completed.

한편, 단계 S240의 확인 결과, 제목과 본문 요약 정보가 포함되어 있지 않은 경우, 제 1 페이지 추적부(114)는 해당 페이지를 분석하여 기 설정된 프로그램 또는 명령어를 이용하여 URL 추출한다(S280).On the other hand, if it is determined in step S240 that the title and body summary information are not included, the first page tracking unit 114 analyzes the page and extracts the URL using a predetermined program or command (S280).

여기서, 제 1 페이지 추적부(114)는 Frame, Iframe, JavaScript로부터 URL을 추출할 수 있다.Here, the first page tracking unit 114 can extract a URL from a frame, an Iframe, and JavaScript.

제 1 페이지 추적부(114)는 추출된 URL을 이용하여 페이지 수집한다(S290).The first page tracking unit 114 collects pages using the extracted URL (S290).

한편, 제 1 페이지 추적부(114)는 수집된 페이지의 본문 텍스트에 본문 요약 정보가 포함되었는지의 여부 확인한다(S250).Meanwhile, the first page tracking unit 114 checks whether the text summary information is included in the body text of the collected page (S250).

단계 S250의 확인 결과, 본문 요약 정보가 포함되어 있는 경우, 본문 요약 정보를 포함한 페이지를 추적 완료한다.As a result of checking in step S250, if the body summary information is included, the page including the body summary information is traced.

한편, 단계 S250의 확인 결과, 본문 요약 정보가 포함되어 있지 않은 경우, 제 1 페이지 추적부(114)는 해당 페이지를 분석하여 기 설정된 프로그램 또는 명령어를 이용하여 URL 추출한다(S280).On the other hand, if it is determined in step S250 that the text summary information is not included, the first page tracking unit 114 analyzes the page and extracts a URL using a predetermined program or command (S280).

한편, 제 1 페이지 추적부(114)는 수집된 페이지의 본문 텍스트에 본문 요약 정보의 일부분이 포함되었는지의 여부 확인한다(S260).Meanwhile, the first page tracking unit 114 checks whether the body text of the collected page includes a part of the body summary information (S260).

단계 S260의 확인 결과, 본문 요약 정보의 일부분이 포함되어 있는 경우, 본문 요약 정보의 일부분을 포함한 페이지를 추적 완료한다.As a result of the check in step S260, if a part of the body summary information is included, the page including a part of the body summary information is traced.

한편, 단계 S260의 확인 결과, 본문 요약 정보의 일부분이 포함되어 있지 않은 경우, 제 1 페이지 추적부(114)는 해당 페이지를 분석하여 기 설정된 프로그램 또는 명령어를 이용하여 URL 추출한다(S280).On the other hand, if it is determined in step S260 that a part of the text summary information is not included, the first page tracking unit 114 analyzes the page and extracts the URL using a predetermined program or command (S280).

한편, 제 1 페이지 추적부(114)는 수집된 페이지의 본문 텍스트에 제목이 포함되었는지의 여부 확인한다(S270).On the other hand, the first page tracking unit 114 checks whether the title text is included in the body text of the collected page (S270).

단계 S270의 확인 결과, 제목이 포함되어 있는 경우, 제목을 포함한 페이지를 추적 완료한다.If it is determined in step S270 that the title is included, the page including the title is tracked.

한편, 단계 S270의 확인 결과, 제목이 포함되어 있지 않은 경우, 제 1 페이지 추적부(114)는 해당 페이지를 분석하여 기 설정된 프로그램 또는 명령어를 이용하여 URL 추출한다(S280).On the other hand, if it is determined in step S270 that the title is not included, the first page tracking unit 114 analyzes the page and extracts the URL using a predetermined program or command (S280).

한편, 제 1 페이지 추적부(114)는 단계 S240, 단계 S250, 단계 S260 및 단계 S270에서 수집된 페이지의 본문 텍스트에 제목 또는 본문 요약 정보가 포함되었는지의 여부 확인하기 위해 수집된 페이지의 바디(Body) 태그 영역 내의 텍스트에 클루 정보의 단어를 포함하고 있는지의 여부를 검사하는 방식을 이용할 수 있다.On the other hand, the first page tracking unit 114 checks whether the body text of the page collected in step S240, step S250, step S260, and step S270 includes body or body summary information of the collected page ) It is possible to use a method of checking whether the text in the tag area includes a word of clue information.

도 3은 본 발명의 실시예에 따른 컨텐츠 검출부의 기능을 설명하기 위한 순서도이다.3 is a flowchart illustrating a function of a content detector according to an embodiment of the present invention.

컨텐츠 검출부(120)는 추적된 페이지에서 클루 정보를 이용하여 추적된 페이지의 제목을 포함하는 부분이 발견되는 경우, 상기 클루 정보의 제목과 일치하는 최하위 영역을 추출하고, 추출된 영역이 목록(Link List)인지의 여부를 확인하고, 확인 결과, 추출된 영역이 목록 형태일 경우, 상기 추출된 영역을 메인 컨텐츠 영역에서 제외하며, 상기 추출된 영역을 상위 노드(Node)로 확대해 가면서 링크(Link)가 존재하며 반복적인 패턴(Pattern)을 갖는 영역인지의 여부를 검사하여 메인 컨텐츠(Main Content) 영역을 추출한다.If the portion including the title of the tracked page is found using the clue information in the tracked page, the content detecting unit 120 extracts the lowest region coinciding with the title of the clue information, If the extracted region is a list form, the extracted region is excluded from the main content region, and the extracted region is expanded to an upper node (Node) ) Is present and a region having a repetitive pattern is checked to extract a main content area.

또한, 컨텐츠 검출부(120)는 HTML(HyperText Markup Language) 규약에 벗어나는 부분에 대해 보정 작업 실시하며, HTML 텍스트를 분석하여 돔 트리(DOM Tree)를 생성한다.In addition, the content detecting unit 120 corrects a portion deviating from the HTML (HyperText Markup Language) convention, and analyzes the HTML text to generate a DOM Tree.

컨텐츠 검출부(120)는 제목을 포함하는 부분이 다수 발견되는 경우, 클루 정보의 제목과 일치하는 최하위 영역을 추출한다(S310).When a plurality of parts including the title are found, the content detecting unit 120 extracts the lowest region coinciding with the title of the clue information (S310).

컨텐츠 검출부(120)는 추출된 영역이 목록(Link List) 형태인지의 여부를 확인한다(S320).The content detecting unit 120 determines whether the extracted area is a list (Link List) (S320).

단계 S320의 확인 결과, 추출된 영역이 목록 형태일 경우, 컨텐츠 검출부(120)는 추출된 영역을 메인 컨텐츠 영역에서 제외한다(S322).If it is determined in step S320 that the extracted area is a list type, the content detecting unit 120 excludes the extracted area from the main content area (S322).

또한, 컨텐츠 검출부(120)는 추출된 영역을 상위 노드(Node)로 확대해 가면서 링크(Link)가 존재하며 반복적인 패턴(Pattern)을 갖는 영역인지의 여부를 검사한다.In addition, the content detector 120 expands the extracted region to an upper node (Node), and checks whether or not a link exists and is a region having a repetitive pattern.

한편, 단계 S320의 확인 결과, 추출된 영역이 목록 형태가 아닌 경우, 컨텐츠 검출부(120)는 주변 영역의 태그 특성에 따라 제목을 포함한 노드에 우선 순위를 부여한다(S330).On the other hand, if it is determined in step S320 that the extracted area is not a list type, the content detection unit 120 assigns priorities to the nodes including the title according to the tag characteristics of the surrounding area (S330).

컨텐츠 검출부(120)는 우선 순위를 부여한 노드 중 검사할 노드가 있는지의 여부를 확인한다(S332).The content detecting unit 120 determines whether there is a node to be checked among the nodes to which priority has been assigned (S332).

단계 S332의 확인 결과, 검사할 노드가 있는 경우, 컨텐츠 검출부(120)는 제목을 포함한 노드를 상위 노드(돔 트리상 페어런츠 노드)로 확장해 가면서 검사를 수행한다(S334).If it is determined in step S332 that there is a node to be inspected, the content detector 120 expands the node including the title to an upper node (parent tree node on the domed tree) to perform the inspection (S334).

컨텐츠 검출부(120)는 확장된 상위 노드에서 제목을 포함하지 않는 하위(이웃) 노드에 클루 정보의 본문 요약 정보가 포함되는지의 여부를 확인한다(S336).The content detecting unit 120 determines whether the body summary information of the clue information is included in the lower (neighbor) node that does not include the title in the extended upper node (S336).

즉, 컨텐츠 검출부(120)는 주변 영역의 태그(Tag) 특성에 따라 우선 순위 부여하고, 제목을 포함한 노드를 상위 노드로 확장해가면서 검사하며, 확장된 상위 노드에서 제목을 포함하지 않는 하위 노드 또는 이웃 노드에 클루 정보의 본문 요약 정보가 존재하는지 검사하는 것이다. 여기서, 상위 노드는 돔 트리상 페어런트 노트(Parent Node)를 의미한다.In other words, the content detecting unit 120 assigns priorities according to the tag characteristics of the surrounding area, expands the node including the title to an upper node, and checks the extended node. And to check whether the neighboring node has the body summary information of the clue information. Here, the parent node means a parent node on the domed tree.

한편, 단계 S332의 확인 결과, 검사할 노드가 없는 경우, 컨텐츠 검출부(120)는 피드 룰에 의해 영역이 추출되는지의 지의 여부를 확인한다(S340).On the other hand, if it is determined in step S332 that there is no node to be inspected, the content detecting unit 120 determines whether the area is extracted by the feed rule (S340).

단계 S340의 확인 결과, 피드 룰에 의해 영역이 추출되지 않는 경우, 컨텐츠 검출부(120)는 피드 등록 시 분석된 정보로 컨텐츠 영역을 추정한다(S344).As a result of checking in step S340, if the area is not extracted by the feed rule, the content detector 120 estimates the content area with the information analyzed at the time of registering the feed (S344).

또한, 컨텐츠 검출부(120)는 돔 트리 구조가 일치할 경우 컨텐츠 영역으로 선택하여 단계 S342를 수행한다.Also, the content detecting unit 120 selects a content area if the domed tree structure matches, and performs step S342.

컨텐츠 검출부(120)는 선택된 제목 노드에서 컨텐츠 영역을 추출하지 못할 경우 다음 우선 순위의 제목 노드로 검사를 수행한다.If the content detection unit 120 fails to extract the content area from the selected title node, the content detection unit 120 checks the title node of the next highest priority.

단계 S336의 확인 결과, 하위(이웃) 노드에 클루 정보의 본문 요약 정보가 포함되는 경우 또는 단계 S340의 확인 결과, 피드 룰에 의해 영역이 추출되는 경 우, 컨텐츠 검출부(120)는 컨텐츠 영역을 추출한다(S342).As a result of checking in step S336, if the lower (neighbor) node includes the body summary information of the clue information or if it is determined in step S340 that the region is extracted by the feed rule, the content detection unit 120 extracts (S342).

즉, 컨텐츠 검출부(120)는 돔 트리 구조가 일치하지 않을 경우 텍스트가 많거나 큰 이미지가 존재하는 영역을 컨텐츠 영역으로 선택하는 것이다.That is, if the domed tree structure does not coincide with each other, the content detecting unit 120 selects an area having a large amount of text or a large image as a content area.

컨텐츠 검출부(120)는 컨텐츠 이외의 영역(댓글, 광고)을 제거한다(S350).The content detecting unit 120 removes an area (comment, advertisement) other than the content (S350).

컨텐츠 검출부(120)는 컨텐츠 레이아웃을 생성한다(S352).The content detecting unit 120 generates a content layout (S352).

즉, 컨텐츠 검출부(120)는 컨텐츠 이외의 영역(댓글, 광고)을 제거한 레이아웃을 생성하는 것이다.That is, the content detecting unit 120 generates a layout in which an area other than the content (comment, advertisement) is removed.

도 4는 본 발명의 실시예에 따른 검사 기법을 설명하기 위한 예시도이다.4 is an exemplary diagram for explaining an inspection technique according to an embodiment of the present invention.

도 4에 도시된 바와 같이 html의 바디의 테이블에 포함된 td(410)의 상위 또는 페어런츠 노드는 tr(420)이되는 것이다. #text(430)의 상위로 확대될 수 있다. As shown in FIG. 4, the upper or parent node of td 410 included in the body table of html becomes tr 420. can be expanded to the upper level of #text (430).

제 1 영역(440)은 #text(430)에서 “#text” Node를 상위로 확대해 갈 때 “#text” Node를 포함되지 않는 영역이다.The first area 440 is an area not including the "#text " Node when enlarging the "#text " Node from the #text 430 to the upper part.

#text(430)번에서 제목 영역을 확대해 가면서 제목을 포함하지 않는 영역에서 본문 요약 정보를 포함하는지를 검사할 경우 제 1 영역(440)의 영역에 본문 요약 정보가 존재하는지 검사하는 것이다.If the subject area is enlarged in #text (430) and it is checked whether the subject information includes the summary information in the area that does not include the title, it is checked whether the text summary information exists in the area of the first area (440).

제 2 영역(450)의 패턴은 “div.a.#text.div.#text”로 표시할 수 있으며, 제 2 영역(450)과 동일한 패턴은 이웃하는 영역에서도 발견된다.The pattern of the second area 450 can be represented by "div.a. # text.div. # Text", and the same pattern as the second area 450 is found in the neighboring area.

제 3 영역(460)의 “div” 태그는 제 2 영역(450)과 같은 패턴을 4 개 가지고 있으며 해당 패턴은 “a” 태그를 포함하므로 목록(Link List)을 포함하는 영역으로 판별할 수 있다.The " div " tag of the third area 460 has four patterns similar to the second area 450 and the pattern includes the " a " tag, .

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술 분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. 이러한 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 저장매체로서는 자기 기록매체, 광 기록매체, 캐리어 웨이브 매체 등이 포함될 수 있다.While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. The codes and code segments constituting the computer program may be easily deduced by those skilled in the art. Such a computer program can be stored in a computer-readable storage medium, readable and executed by a computer, thereby realizing an embodiment of the present invention. As the storage medium of the computer program, a magnetic recording medium, an optical recording medium, a carrier wave medium, or the like may be included.

또한, 이상에서 기재된 "포함하다", "구성하다" 또는 "가지다" 등의 용어는, 특별히 반대되는 기재가 없는 한, 해당 구성 요소가 내재될 수 있음을 의미하는 것이므로, 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것으로 해석되어야 한다. 기술적이거나 과학적인 용어를 포함한 모든 용어들은, 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥 상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.It is also to be understood that the terms such as " comprises, "" comprising," or "having ", as used herein, mean that a component can be implanted unless specifically stated to the contrary. But should be construed as including other elements. All terms, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. Commonly used terms, such as predefined terms, should be interpreted to be consistent with the contextual meanings of the related art, and are not to be construed as ideal or overly formal, unless expressly defined to the contrary.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

이상에서 설명한 바와 같이 본 발명은 RSS 내의 특정 페이지에 대한 요약 정보를 바탕으로 대상 웹페이지에서 사용자가 실제 컨텐츠라고 인식하는 내용만을 추출하기 위한 웹 페이지 내의 컨텐츠를 추출하기 위한 방법, 장치 및 컴퓨터로 읽을 수 있는 기록매체 분야에 적용되어, RSS 형태로 제공되는 다양한 컨텐츠 페이지를 대상으로, RSS의 요약 정보를 활용하여 사용자가 원하는 본문페이지만을 추출하여 제공할 수 있는 효과를 발생하는 매우 유용한 발명이다.As described above, the present invention provides a method and apparatus for extracting contents in a web page for extracting only contents recognized as actual contents by a user on a target web page based on summary information about a specific page in RSS, The present invention is a very useful invention that generates an effect of extracting and providing only a body page desired by a user by using summary information of RSS to various contents pages provided in RSS format.

도 1은 본 발명의 실시예에 따른 웹 페이지 내의 컨텐츠를 추출하기 위한 장치를 개략적으로 나타낸 블럭 구성도,1 is a block diagram schematically showing an apparatus for extracting contents in a web page according to an embodiment of the present invention;

도 2는 본 발명의 실시예에 따른 페이지 추적부의 기능을 설명하기 위한 순서도,FIG. 2 is a flowchart illustrating a function of a page tracking unit according to an embodiment of the present invention. FIG.

도 3은 본 발명의 실시예에 따른 컨텐츠 검출부의 기능을 설명하기 위한 순서도,3 is a flowchart illustrating a function of a content detector according to an embodiment of the present invention.

< 도면의 주요 부분에 대한 부호의 설명 >Description of the Related Art

110: 메인 컨텐츠 검출부 112: 체크 캐쉬부110: main content detecting unit 112: check cache unit

114: 제 1 페이지 추적부 120: 컨텐츠 검출부114: first page tracking unit 120: content detection unit

122: 컨텐츠 리비전부 130: 컨텐츠 저장부122: contents rewrite unit 130: contents storage unit

132: 결과 전송부 140: 피드 레지스트레이션부132: Result transmitting unit 140: Feed registration unit

142: 아이템 추출부 144: 제 2 페이지 추적부142: Item extracting unit 144: Second page tracking unit

146: 패턴 분석부 148: 패턴 저장부146: pattern analyzing unit 148: pattern storing unit

Claims

When a URL page of item information extracted from a preset feed is not a collection target, a page is configured using a preset program or command, and clue information of the item information is stored in a configured page A first page tracking unit for tracking only the pages including the clue information using the first page tracking unit;

When receiving a request for content extraction from a core that implements RSS, when a portion including the title of the tracked page is found using the clue information in the tracked page, Extracts the extracted area from the main content area if it is determined that the extracted area is a list (Link List), and if the extracted area is a list form, excludes the extracted area from the main content area, (Content Detection) for extracting a main content area by examining whether or not a link exists and an area having a repetitive pattern while expanding to a node (Node);

Reads a feed rule file, reads existing analysis information of the feed, compares the extracted analysis information with the extracted main content area or the clue information, and if there is an insufficient or wrong part, A content revision unit for re-controlling the main content area;

A content storage unit (Store Content) for storing the main content extracted or readjusted as a file; And

A Send Result for transmitting the extracted main content to the client,

Lt; / RTI >

Wherein the content detecting unit comprises:

The node including the title is prioritized, the node including the title is extended to an upper node, and the summary information of the clue information is present in a lower node or a neighbor node that does not include the title in the extended upper node Wherein the determination unit determines whether the content of the web page is included in the web page.

delete

The method according to claim 1,

The lowermost region,

And the last node of the domed tree.

delete

The method according to claim 1,

The upper node

Wherein the parent node is a domed tree parent node.

The method according to claim 1,

The priority order may be,

The apparatus for extracting content in a web page characterized by: Emphasis> Normal> Link> Hidden.

The method according to claim 1,

Wherein the content detecting unit comprises:

And if the content area can not be extracted from the selected title node, the examination is performed with the title node having the next priority.

8. The method of claim 7,

The content detector

Wherein when the content area is not extracted, the content area is estimated with the information analyzed at the time of registering the feed, and the selected content area is selected when the dome tree structure matches.

9. The method of claim 8,

Wherein the content detecting unit comprises:

When the domed tree structure does not coincide with each other, selects a region in which an image exists and a text region closest to the image as a content region, and creates a layout in which a comment region or an advertisement region other than the content is removed. Apparatus for extracting content within a web page.

The method according to claim 1,

Wherein the first page tracking unit comprises:

And a page to be inspected is divided into a blank character and a special character for the title and body summary information, and a page corresponding to the URL of the item information is collected from the corresponding web site. Device.

11. The method of claim 10,

Wherein the first page tracking unit comprises:

Checking whether the title and the summary information are included in the body text of the collected page, and if the title and the summary information are included as a result of the check, tracking the page including the title and the summary information If the title and the body summary information are not included as a result of the check, the server analyzes the corresponding page, extracts a URL using a predetermined program or command, and collects the page using the extracted URL. Apparatus for extracting content within a web page.

12. The method of claim 11,

Wherein the first page tracking unit comprises:

It is checked whether or not the text in the body tag area of the collected page includes the word of the clue information in order to check whether the title text and the body summary information are included in the body text of the collected page And extracting the content in the web page.

The method according to claim 1,

Wherein the first page tracking unit comprises:

And tracks a page including some words corresponding to the clue information.

The method according to claim 1,

Wherein the predetermined program is made of JavaScript and the command redirects the web page to a frame or an Iframe.

The method according to claim 1,

The content detector

Wherein a correction operation is performed on a part deviating from the HTML (HyperText Markup Language) protocol, and a DOM tree is generated by analyzing the HTML text.

The method according to claim 1,

A feed registration unit for receiving a feed from the feed input means;

Extracts the item information from the received feed, collects the corresponding RSS feed page from the page corresponding to the received URL, and extracts at least one of the URL, title and summary information of the item information in the collected RSS feed page An Extract Item;

A second page tracking unit for tracking the page including the clue information using the clue information of the item information when the page corresponding to the URL is not an object to be collected;

A pattern for analyzing a segment of the extracted segment by tracking the main content for each item of the plurality of item information extracted from the feed and extracting the main content area including the main summary information Pattern Analysis; And

A pattern storing section (Store Pattern) for storing the analyzed pattern as a file,

Wherein the apparatus further comprises means for extracting content in the web page.

17. The method of claim 16,

Wherein the feed supports at least one of ATOM, RSS 1.0 and RSS 2.0 standards.

The method according to claim 1,

A check cache unit for checking whether a previously extracted page exists for a requested page and for reading the previously extracted page when the previously extracted page is found as a result of the check,

The method according to claim 1,

The clue information,

A title, and a summary of the contents of the web page.

(a) configuring a page using a predetermined program or command if a URL page of item information extracted from a preset feed is not an object of collection;

(b) tracking only the page including the clue information using the clue information of the item information in a page configured by the page tracking unit;

(c) When a content detecting unit receives a request for content extraction from a core implementing RSS, when a portion including the title of the tracked page is found using the clue information in the tracked page, Extracts a lowest region corresponding to the title, checks whether the extracted region is a list (Link List), excludes the extracted region from the main content region if the extracted region is a list type, Extracting a main content area by examining whether or not the extracted area is an area having a link and exists in a repetitive pattern while enlarging the extracted area to an upper node Node;

(d) reading the existing analysis information of the feed by reading the feed rule file of the content revision, comparing the extracted analysis information with the extracted main content area or the clue information, and if there is an insufficient or wrong part, Redistributing the main content area using analysis information;

(e) storing the main content extracted or re-adjusted by the content storage unit as a file; And

(f) transmitting, to the client, the main content whose result transmitting unit has extracted or readjusted

Lt; / RTI >

Wherein the content detecting unit comprises:

The node including the title is prioritized, the node including the title is extended to an upper node, and the summary information of the clue information is present in a lower node or a neighbor node that does not include the title in the extended upper node The method comprising the steps of:

A recording medium on which a program for executing a device for extracting contents in a web page is recorded,

A function of configuring a page using a preset program or a command if a URL page of item information extracted from a pre-set feed is not an object of collection;

A function of tracking only a page including the clue information using clue information of the item information in a page configured by the page tracking unit;

When a content detection unit receives a request for content extraction from a core implementing RSS, when a part including the title of the tracked page is found using the clue information in the tracked page, Extracts the lowest region, checks whether the extracted region is a list (Link List), and if the extracted region is a list form, excludes the extracted region from the main content region, A function of extracting a main content area by examining whether or not a link exists and an area having a repetitive pattern while enlarging the area to an upper node Node;

Reads the existing analysis information of the feed by reading a feed rule file of the content revision section, compares the extracted analysis information with the extracted main content area or the clue information, and if there is an insufficient or wrong part, A function of re-controlling the main content area by using the main content area;

A function of storing the main content extracted or re-adjusted by the content storage unit as a file; And

And a function of transmitting the main content to the client,

And,

Wherein the content detecting unit comprises:

The node including the title is prioritized, the node including the title is extended to an upper node, and the summary information of the clue information is present in a lower node or a neighbor node that does not include the title in the extended upper node And a computer readable recording medium storing a program recorded thereon.