KR100284578B1

KR100284578B1 - News information collection system and collection method supporting mobile computing

Info

Publication number: KR100284578B1
Application number: KR1019980054167A
Authority: KR
Inventors: 강대기; 이경호; 함호상
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1998-12-10
Filing date: 1998-12-10
Publication date: 2001-03-15
Also published as: KR20000038979A

Abstract

본 발명은, 인터넷의 웹 상에서 HTML(Hypertext Markup Language) 문서의 형태로 지원되는 온라인 잡지와 신문의 각 면(section)을 수집하고, 소정의 처리과정을 거쳐 신문기사 레코드를 추출하여 사용자에게 제공할 수 있는 뉴스 정보 수집 시스템과 수집방법에 관한 것으로, 본 발명의 일면에 따른 뉴스 정보 수집 시스템은, 인터넷의 웹 상에 있는 온라인 신문 또는 잡지 사이트로부터 HTML 문서의 형태로 지원되는 신문의 각 면을 HTTP(Hypertext Transfer Protocol) 통신에 의해 수집하는 HTTP 통신부(10)와, 상기 HTTP 통신부(10)에 의해 수집된 신문기사 문서에 존재하는 잡음을 제거하는 잡음제거부(20)와, 개발자 또는 사용자가 요구하는 정규식 패턴을 입력하는 패턴입력부(40)와, 상기 잡음제거부(20)에 의해 잡음이 제거된 웹 문서를 정규식 스트링으로 변환하고, 변환된 정규식 스트링과 상기 패턴입력부(40)를 통해 개발자 또는 사용자에 의해 입력된 정규식 패턴을 정합하여 신문기사 레코드를 추출하는 패턴정합부(30)를 구비한 것을 특징으로 하며, 본 발명에 따르면, 인터넷의 웹 상에서 HTML 문서의 형태로 지원되는 온라인 잡지와 신문의 각 면을 실시간으로 수집하여, 잡음제거 과정과 패턴정합 과정을 거침으로써, 종래의 방법보다 웹 사이트의 변화에 강하면서도, 개발자가 단순하고 쉬운 방법으로 기사 정보를 추출할 대상을 지적함으로써 실시간으로 원하는 기사 정보를 추출할 수 있다.The present invention collects sections of online magazines and newspapers supported in the form of HTML (Hypertext Markup Language) documents on the web of the Internet, extracts newspaper article records through a predetermined process, and provides them to users. The present invention relates to a news information collection system and a collection method, wherein the news information collection system according to an aspect of the present invention provides a method for HTTP printing each side of a newspaper supported in the form of an HTML document from an online newspaper or magazine site on the web of the Internet. (HTTP) to collect by the (Hypertext Transfer Protocol) communication, the noise canceling unit 20 to remove the noise present in the newspaper article collected by the HTTP communication unit 10, and the developer or user A pattern input unit 40 for inputting a regular expression pattern, and a web document from which noise is removed by the noise removing unit 20 is converted into a regular expression string, and the converted regular expression And a pattern matching unit 30 for matching a regular expression pattern input by a developer or a user through a pattern input unit 40 and extracting a newspaper article record. By collecting each page of online magazines and newspapers supported in the form of HTML documents in real time and going through the noise reduction process and pattern matching process, the developer is simpler and easier to change the website than the conventional method. By pointing out the target to extract the article information, you can extract the desired article information in real time.

Description

News information collection system and collection method supporting mobile computing

본 발명은, 이동 컴퓨팅을 지원하는 뉴스 정보 수집 시스템 및 수집방법에 관한 것으로, 특히, 인터넷의 웹 상에서 HTML(Hypertext Markup Language) 문서의 형태로 지원되는 온라인 잡지와 신문의 각 면(section)을 수집하고, 소정의 처리과정을 거쳐 신문기사 레코드를 추출하여 사용자에게 제공할 수 있는 뉴스 정보 수집 시스템과 수집방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a news information collection system and collection method that supports mobile computing, and more particularly, to collect sections of online magazines and newspapers supported in the form of Hypertext Markup Language (HTML) documents on the web of the Internet. In addition, the present invention relates to a news information collection system and a collection method that can provide a user with a newspaper article record through a predetermined process.

최근에, 정보화 사회의 도래에 따라 웹(World Wide Web)이 보편화되면서, 신문이나 TV, 잡지와 같은 언론 및 방송매체 사이트들의 인터넷 진출이 두드러지고 있다. 이러한 사이트들은 일정한 주기마다 내용을 바꾸고 있으며, 기본적으로 갖고 있는 다른 매체에서의 지명도를 토대로 웹에서의 영향력을 계속 높여가고 있다.In recent years, with the advent of the information society, the World Wide Web has become commonplace, and the advancement of the Internet of the media and broadcasting media sites such as newspapers, TVs, and magazines has become prominent. These sites change their content at regular intervals, and continue to increase their impact on the web based on their reputation in other media.

본 발명에 관련된 온라인 웹 사이트에서 HTML 형식의 정보를 수집·가공하여 사용자에게 제공하는 종래의 기술은 크게 두 가지로 분류할 수 있다.Conventional techniques for collecting and processing information in HTML format and providing it to a user in an online web site related to the present invention can be broadly classified into two types.

첫째는, 기존의 사용자가 선호하는 신문 분류를 기반으로 하여 신문기사 정보를 사용자에게 보내는 푸쉬(push) 솔루션이다. 이와 같은 푸쉬 솔루션들, 특히 언론 및 방송 매체들의 정보들을, 웹 브라우저를 사용하여 사용자가 직접 검색하지 않고도, 사용자가 TV 채널을 시청하는 것과 같이 신문기사를 볼 수 있도록 해 준다. 이러한 푸쉬 솔루션에 따르면, 우선 "정치", "경제", "사회" 등과 신문의 내부 분류 중 사용자가 선호하는 면을 선택하게 하고, 선호하는 면에 대한 기사 정보를 사용자의 컴퓨터 단말기가 유휴한 시간에 서버가 신문 사이트에서 추출하여 보낸다. 사용자의 컴퓨터 단말기가 유휴한 시간으로는 사용자의 스크린 세이버가 동작하는 시기가 좋은 예이다. 따라서, 푸쉬 솔루션들은 사용자의 스크린 세이버 응용 프로그램의 형태로 많이 존재한다. 이것이 초기의 푸쉬 솔루션의 형태로서, 나중에는 개인화된 신문 외에도 다양한 응용에 사용되게 되었다.First, it is a push solution that sends newspaper article information to the user based on the newspaper classification preferred by the existing user. Such push solutions, particularly press and broadcast media, allow users to view newspaper articles as if they were watching TV channels without having to search for them directly using a web browser. According to this push solution, the user first selects the user's preference among the internal classifications of "politics", "economy", "society", and the like, and the time when the user's computer terminal is idle for the article information about the preferences. On the server sends it extracted from the newspaper site. A good example is when the user's computer terminal is idle. Thus, there are many push solutions in the form of a user's screensaver application. This was the form of an early push solution, which was later used for a variety of applications in addition to personalized newspapers.

두 번째는, 상품정보 검색기술을 갖고 있는 쇼핑 에이젼트(agent)가 존재하는데, 이러한 쇼핑 에이젼트와 관련된 기술은 다음과 같은 문헌에 개시되어 있다.Secondly, there is a shopping agent having a product information retrieval technology, and the technology related to this shopping agent is disclosed in the following literature.

김현돈 외, "비교 구매 기능을 갖는 쇼핑 에이젼트의 원형 개발", 정보과학회 학술발표 논문집, 24(2), 289-292, 1997에는, 온라인 상점들의 HTML 문서를 분석하여 사용자가 요구한 상품을 검색하고, 검색결과를 비교 분석하는 기능을 추가하여, 웹 상의 온라인 비교 구매가 가능하도록 하는 방법에 대해 개시되어 있다.Kim Hyun-don et al., "Development of a Shopping Agent with a Comparison Purchasing Function," Proceedings of the Korean Information Science Society, 24 (2), 289-292, 1997. In addition, the present invention discloses a method of enabling a comparative purchase on the web by adding a function of comparing and analyzing search results.

또한, R. B. Doorenbos et al, "A Scalable Comparison-Shopping Agent for the World-Wide Web", Proceedings of the 1^stInternational Conference on Autonomous Agents, 1(1), 39-48, 1997에는, 온라인 상점의 탐색가능한 인덱스에 대해 질의를 보내고 결과를 해석하기 위한 범용적인 인터페이스 모듈을 위한 데이터를 자동으로 생성하는 학습기와, 이를 기반으로 사용자 질의를 해석하여 온라인 상점들의 탐색가능한 인덱스에 대해 질의를 보내는 쇼핑기로 구성된 쇼핑 에이젼트에 대해 기재되어 있다.RB Doorenbos et al, "A Scalable Comparison-Shopping Agent for the World-Wide Web", Proceedings of the 1 ^st International Conference on Autonomous Agents, 1 (1), 39-48, 1997, A shopping agent that consists of a learner that automatically generates data for a universal interface module for querying indexes and interpreting results, and a shopping agent that parses user queries and sends queries against the searchable indexes of online stores based on it. Is described.

그러나, 전술한 푸쉬 솔루션들은, 뉴스나 잡지 사이트와의 계약에 의해 미리 정해진 프로토콜을 사용하여 기사 정보를 공급받아야 하거나, 사람의 손에 의해 임기응변식으로 구현된 코드를 사용하여 신문 사이트에서 기사 정보를 추출한다. 따라서, 푸쉬 솔루션 방식에 있어서, 임기응변식으로 구현된 코드에 의한 신문기사 추출방법은, 신문 사이트가 바뀌면 이미 작성된 코드가 소용없게 된다는 결정적인 단점이 있었다.However, the above-mentioned push solutions have to be supplied with article information using a predetermined protocol by a contract with a news or magazine site, or the article information can be retrieved from a newspaper site by using a code that is imperatively implemented by human hands. Extract. Therefore, in the push solution method, the method of extracting newspaper articles by the code implemented by the temporal adaptation method has a decisive disadvantage that the already written code becomes useless when the newspaper site is changed.

또한, 상품정보 검색을 위한 쇼핑 에이젼트의 경우에는, 온라인 상점을 학습하여 원하는 상품정보를 나타내는 정규식 패턴을 생성해 내는데, 이러한 정규식 패턴은 잡음이 제거되지 않은 웹 문서에서 그대로 얻어지므로, 상점이 조금이라도 변하면 다시 학습을 시켜야 한다는 문제점이 있었다.In addition, in the case of a shopping agent for product information retrieval, the online store is trained to generate a regular expression pattern representing the desired product information. Since the regular expression pattern is obtained as is from a web document without noise removal, The problem was that if you change, you have to learn again.

결국, 본 발명의 목적은, 이렇게 주기적으로 변화하는 신문 또는 잡지 사이트로부터, 종래의 방법보다 웹 사이트의 변화에 강하면서도, 개발자가 단순하고 쉬운 방법으로 기사 정보를 추출할 대상을 지적함으로써 실시간으로 원하는 기사 정보를 추출할 수 있도록 하는 뉴스 정보 수집 시스템과 수집방법을 제공함에 있다.As a result, the object of the present invention is to be in real-time by pointing out a target to extract article information in a simple and easy way, while being more resistant to changes in the web site than the conventional method from such a periodically changing newspaper or magazine site. The present invention provides a news information collection system and a collection method for extracting article information.

도 1은 본 발명에 따른 이동 컴퓨팅을 지원하는 뉴스 정보 수집 시스템의 블록 구성도,1 is a block diagram of a system for collecting news information supporting mobile computing according to the present invention;

도 2는 본 발명에 따른 이동 컴퓨팅을 지원하는 뉴스 정보 수집과정의 전체 흐름도,2 is an overall flowchart of a news information collection process supporting mobile computing according to the present invention;

도 3은 도 2에 도시된 뉴스 정보 수집과정 중에서 잡음제거 과정에 대한 상세 흐름도,3 is a detailed flowchart illustrating a noise removing process in the news information collecting process shown in FIG. 2;

도 4는 도 2에 도시된 뉴스 정보 수집과정 중에서 패턴정합 과정에 대한 상세 흐름도.FIG. 4 is a detailed flowchart illustrating a pattern matching process among news information collecting processes illustrated in FIG. 2.

* 도면의 주요부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

10 : HTTP 통신부 20 : 잡음제거부10: HTTP communication unit 20: noise removing unit

30 : 패턴정합부 40 : 패턴입력부30: pattern matching section 40: pattern input section

50 : PDA 설치부50: PDA installation unit

상기한 목적을 달성하기 위한, 본 발명의 일면에 따른 뉴스 정보 수집 시스템은,In order to achieve the above object, a news information collection system according to an aspect of the present invention,

인터넷의 웹 상에 있는 온라인 신문 또는 잡지 사이트로부터 HTML 문서의 형태로 지원되는 신문의 각 면을 HTTP(Hypertext Transfer Protocol) 통신에 의해 수집하는 HTTP 통신부와,An HTTP communication unit that collects, by HTTP (Hypertext Transfer Protocol) communication, each side of a newspaper supported in the form of an HTML document from an online newspaper or magazine site on the web of the Internet;

상기 HTTP 통신부에 의해 수집된 신문기사 문서에 존재하는 잡음을 제거하는 잡음제거부와,A noise removing unit for removing noise present in the newspaper article document collected by the HTTP communication unit;

개발자 또는 사용자가 요구하는 정규식 패턴을 입력하는 패턴입력부와,A pattern input unit for inputting a regular expression pattern required by a developer or a user,

상기 잡음제거부에 의해 잡음이 제거된 웹 문서를 정규식 스트링으로 변환하고, 변환된 정규식 스트링과 상기 패턴입력부를 통해 개발자 또는 사용자에 의해 입력된 정규식 패턴을 정합하여 신문기사 레코드를 추출하는 패턴정합부를 구비한 것을 특징으로 한다.The pattern matching unit converts the web document from which the noise is removed by the noise removing unit into a regular expression string, extracts a newspaper article record by matching the converted regular expression string with a regular expression pattern input by a developer or a user through the pattern input unit. Characterized in that provided.

또한, 본 발명의 또 다른 일면에 따른 뉴스 정보 수집 시스템은, 상기 추출된 신문기사들을 사용자에게 제시하고 사용자가 선택한 기사를 PDA(Personal Digital Assistant)에 자동 설치하는 PDA 설치부를 더 구비한 것을 특징으로 한다.In addition, the news information collection system according to another aspect of the present invention, characterized in that it further comprises a PDA installation unit for presenting the extracted newspaper articles to the user and automatically installs the article selected by the user to a PDA (Personal Digital Assistant) do.

이때, 상기한 잡음제거부는, 상기 HTTP 통신부에 의해 수집된 신문기사 문서의 목록을 구성하는 각각의 레코드들은 둘러싼 앵커 태그를 상기한 패턴정합부가 처리하기 용이한 형태로 변환하는 앵커태크 변환모듈과, 상기 앵커태그가 변환된 신문기사 문서에서 불필요한 태그를 삭제하는 태그 제거모듈을 구비하는 것이 바람직하다.At this time, the noise canceling unit, each of the records constituting the list of newspaper article documents collected by the HTTP communication unit anchor tag conversion module for converting the surrounding anchor tag into a form that the pattern matching unit is easy to process; It is preferable to include a tag removal module for deleting unnecessary tags in the newspaper article document converted the anchor tag.

또한, 상기한 패턴정합부는, 상기 잡음제거부에 의해 잡음이 제거된 웹 문서를 정규식 스트링으로 변환하는 정규식 스트링 생성기와, 상기 정규식 스트링 생성기에 의해 변환된 정규식 스트링과 상기 패턴입력부를 통해 개발자 또는 사용자에 의해 입력된 정규식 패턴을 정합하여, 정합에 성공한 부문자열(substring)에 대응하는 웹 문서 부분을 기사 목록의 한 레코드로서 추출하는 정규식 스트링 정합기를 구비하는 것이 바람직하다.The pattern matching unit may include a regular expression string generator for converting a web document from which noise is removed by the noise removing unit into a regular expression string, and a developer or user through the regular expression string converted by the regular expression string generator and the pattern input unit. It is preferable to have a regular expression string matcher that matches the regular expression pattern inputted by s, and extracts, as a record of the article list, the portion of the web document corresponding to the substring that has been successfully matched.

아울러, 본 발명의 일면에 따른 뉴스 정보 수집방법은,In addition, the news information collection method according to an aspect of the present invention,

인터넷의 웹 상에 있는 온라인 신문 또는 잡지 사이트로부터 HTML 문서의 형태로 지원되는 신문의 각 면을 HTTP 통신에 의해 수집하는 단계와,Collecting, by HTTP communication, each side of a newspaper supported in the form of an HTML document from an online newspaper or magazine site on the web of the Internet;

상기 수집된 신문기사 문서에 존재하는 잡음을 제거하는 단계와,Removing noise present in the collected newspaper article;

개발자 또는 사용자가 요구하는 정규식 패턴을 입력하는 단계와,Entering the regular expression pattern required by the developer or user;

상기 잡음이 제거된 웹 문서를 정규식 스트링으로 변환하고, 변환된 정규식 스트링과 상기 입력된 정규식 패턴을 정합하여 신문기사 레코드를 추출하는 단계를 포함하는 것을 특징으로 한다.And converting the noise-free web document into a regular expression string, extracting a newspaper article record by matching the converted regular expression string and the input regular expression pattern.

또한, 본 발명의 또 다른 일면에 따른 뉴스 정보 수집방법은, 상기 추출된 신문기사들을 사용자에게 제시하고 사용자가 선택한 기사를 PDA에 자동 설치하는 단계를 더 포함하는 것을 특징으로 한다.In addition, the news information collection method according to another aspect of the present invention, characterized in that it further comprises the step of presenting the extracted newspaper article to the user and automatically install the article selected by the user on the PDA.

본 발명자들은, 인터넷의 웹 상에서 HTML 문서의 형태로 지원되는 온라인 잡지와 신문의 각 면들이 기사 목록의 역할을 한다는 사실에 착안하여, 대상 사이트의 실제 정보와 관계없는 부분의 조그만 변화에는 굳건한 특성을 가질 수 있게 하는 잡음제거 기술과, 개발자의 간단한 지시로 원하는 정보를 추출할 수 있는 패턴정합 기술을 사용함으로써, 원하는 기사 정보를 추출할 수 있도록 하는 뉴스 정보 수집 시스템과 수집방법을 제공할 수 있다는 것을 발견하고, 예의 연구를 거듭한 결과 본 발명을 완성하기에 이르렀다.The inventors note that each side of the online magazines and newspapers supported in the form of HTML documents on the web of the Internet serves as a list of articles, and thus, the characteristics of the small changes in the parts irrelevant to the actual information of the target site are solid. It is possible to provide a news information collection system and a collection method that can extract the desired article information by using the noise reduction technology that allows to have it, and the pattern matching technology that can extract the desired information with the simple instruction of the developer. As a result of discovering and intensive research, the present invention has been completed.

이하, 이동 컴퓨팅를 지원하는 본 발명에 따른 뉴스 정보 수집 시스템 및 수집방법에 대한 바람직한 실시예를 첨부도면을 참조하여 보다 상세히 설명한다.Hereinafter, a preferred embodiment of a news information collection system and a collection method according to the present invention supporting mobile computing will be described in more detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 이동 컴퓨팅을 지원하는 뉴스 정보 수집 시스템의 블록 구성도로서, 도 1에 도시된 것 같이, 본 발명의 바람직한 실시예에 따른 뉴스 정보 수집 시스템은, 인터넷의 웹 상에 있는 온라인 신문 또는 잡지 사이트로부터 HTML 문서의 형태로 지원되는 신문의 각 면을 HTTP 통신에 의해 수집하는 HTTP 통신부(10)와, 이 HTTP 통신부(10)에 의해 수집된 신문기사 문서에 존재하는 잡음을 제거하는 잡음제거부(20)와, 개발자 또는 사용자가 요구하는 정규식 패턴을 입력하는 패턴입력부(40)와, 상기한 잡음제거부(20)에 의해 잡음이 제거된 웹 문서를 정규식 스트링으로 변환하고, 변환된 정규식 스트링과 상기 패턴입력부(40)를 통해 개발자 또는 사용자에 의해 입력된 정규식 패턴을 정합하여 신문기사 레코드를 추출하는 패턴정합부(30)를 구비하며, 추출된 신문기사들을 사용자에게 제시하고 사용자가 선택한 기사를 PDA(60)에 자동 설치하는 PDA 설치부(50)를 추가로 구비한다.1 is a block diagram of a news information collection system supporting mobile computing according to the present invention. As shown in FIG. 1, a news information collection system according to a preferred embodiment of the present invention is located on the web of the Internet. An HTTP communication unit 10 which collects each side of a newspaper supported in the form of an HTML document from an online newspaper or magazine site by HTTP communication, and removes noise present in the newspaper article document collected by the HTTP communication unit 10. Converting the noise canceling unit 20, a pattern input unit 40 for inputting a regular expression pattern required by a developer or a user, and the web document from which noise is removed by the noise removing unit 20 to a regular expression string, And a pattern matching unit 30 which matches the regular expression string converted by the developer or user through the pattern input unit 40 and extracts a newspaper article record. Presenting to the user and exported newspaper article further includes a PDA installation section 50 to automatically install the article selected by the user on the PDA (60).

상기한 잡음제거부(20)는, HTTP 통신부(10)에 의해 수집된 신문기사 문서의 목록을 구성하는 각각의 레코드들은 둘러싼 앵커 태그를 패턴정합부(30)가 처리하기 용이한 형태로 변환하는 앵커태크 변환모듈(21)과, 앵커태그가 변환된 신문기사 문서에서 불필요한 태그를 삭제하는 태그 제거모듈(22)을 구비한다.The noise removing unit 20 converts each of the records constituting the list of newspaper article documents collected by the HTTP communication unit 10 into a form that the pattern matching unit 30 can easily process. An anchor tag conversion module 21 and a tag removal module 22 for deleting unnecessary tags from the newspaper article document in which the anchor tags are converted are provided.

또한, 상기한 패턴정합부(30)는, 잡음제거부(20)에 의해 잡음이 제거된 웹 문서를 정규식 스트링으로 변환하는 정규식 스트링 생성기(31)와, 이 정규식 스트링 생성기(31)에 의해 변환된 정규식 스트링과 상기 패턴입력부(40)를 통해 개발자 또는 사용자에 의해 입력된 정규식 패턴을 정합하여, 정합에 성공한 부문자열에 대응하는 웹 문서 부분을 기사 목록의 한 레코드로서 추출하는 정규식 스트링 정합기(32)로 구성된다.In addition, the pattern matching unit 30 includes a regular expression string generator 31 for converting a web document from which noise is removed by the noise removing unit 20 to a regular expression string, and the regular expression string generator 31 for conversion. A regular expression string matcher that matches the regular expression string inputted by a developer or a user through the pattern input unit 40, and extracts a web document part corresponding to a successful substring as a record of an article list ( 32).

도 2는 본 발명에 따른 이동 컴퓨팅를 지원하는 뉴스 정보 수집과정의 전체 흐름도로서, 본 발명에 따른 뉴스 정보 수집방법에 있어서는, 먼저, 인터넷의 웹 상에 있는 온라인 신문 또는 잡지 사이트로부터 HTML 문서의 형태로 지원되는 신문의 각 면(정치, 경제 면 등)을 HTTP 통신에 의해 수집하여, 신문기사 문서(M1)를 얻는다(S10). 다음에, 수집된 신문기사 문서(M1)에 존재하는 잡음을 제거하여 잡음이 제거된 웹 문서(M2)를 얻는다(S20). 그후, 잡음이 제거된 웹 문서(M2)를 정규식 스트링으로 변환하고, 변환된 정규식 스트링을, 개발자 또는 사용자가 입력(S40)한 정규식 패턴(M3)과 정합하여(S30), 패턴에 의해 정합된 신문기사 레코드(M4)를 추출함으로써, 뉴스 정보를 수집한다. 그후, 상기한 과정을 통해 추출된 신문기사들(M4)을 사용자에게 제시하고, 사용자가 선택한 기사를 PDA(60)에 자동 설치한다(S50).FIG. 2 is a flowchart illustrating a process of collecting news information supporting mobile computing according to the present invention. In the news information collecting method according to the present invention, first, in the form of an HTML document from an online newspaper or magazine site on the web of the Internet, FIG. Each side (political, economic, etc.) of the supported newspapers is collected by HTTP communication to obtain a newspaper article document M1 (S10). Next, the noise existing in the collected newspaper article document M1 is removed to obtain a web document M2 from which the noise is removed (S20). Thereafter, the noise-free web document M2 is converted into a regular expression string, and the converted regular expression string is matched with the regular expression pattern M3 inputted by the developer or the user (S40) (S30), and then matched by the pattern. By extracting the newspaper article record M4, news information is collected. Thereafter, the newspaper articles M4 extracted through the above process are presented to the user, and the article selected by the user is automatically installed in the PDA 60 (S50).

도 3은 도 2에 도시된 뉴스 정보 수집과정 중에서 잡음제거 과정에 대한 상세 흐름도로서, 도 3을 참조하여 잡음제거 과정에 대해 상세히 설명한다. 위에서 얻어진 신문기사 문서(M1)는 기사들의 목록 문서와도 같으며, 이러한 목록을 구성하는 각각의 레코드들은 앵커태그로 둘러싸여 있다. 따라서, 상기한 신문기사에서 가장 중요시해야 할 태그는 앵커태그에 해당한다. 이와 같은 앵커태그는 후술하는 패턴정합 과정에서 처리하기 용이한 형태로 변환하여, 앵커태그가 변환된 신문기사 문서(M5)를 얻는다(S21). 그후, 앵커태그가 변환된 신문기사 문서(M5)에서 불필요한 태그들을 삭제하여, 잡음이 제거된 웹 문서(M2)를 얻는다(S22).3 is a detailed flowchart illustrating a noise removing process in the news information collecting process illustrated in FIG. 2, and the noise removing process will be described in detail with reference to FIG. 3. The newspaper article M1 obtained above is like a list document of articles, and each record constituting the list is surrounded by an anchor tag. Therefore, the most important tag in the newspaper article corresponds to the anchor tag. Such an anchor tag is converted into a form that is easy to process in a pattern matching process to be described later, thereby obtaining a newspaper article document M5 obtained by converting the anchor tag (S21). Thereafter, unnecessary tags are deleted from the newspaper article document M5 with the anchor tag converted to obtain a web document M2 from which the noise is removed (S22).

도 4는 도 2에 도시된 뉴스 정보 수집과정 중에서 패턴정합 과정에 대한 상세 흐름도로서, 도 4에 도시된 것 같이, 상기한 잡음제거 과정으로 거쳐 잡음이 제거된 웹 문서(M2)를 정규식 스트링으로 변환한다(S31). 그후, 변환된 정규식 스트링과 개발자 또는 사용자에 의해 지시된 정규식 패턴(M3)을 정합한다(S32). 이때, 정합에 성공한 부문자열이 존재하는 경우에, 그 문자열에 대응되는 잡음 제거된 웹 문서(M2)의 부분은 기사 목록의 한 레코드로서 해당 기사의 표제에 해당하게 된다.FIG. 4 is a detailed flowchart illustrating a pattern matching process in the news information collecting process shown in FIG. 2. As shown in FIG. 4, the web document M2 from which noise is removed through the noise removing process is converted into a regular expression string. (S31). Thereafter, the converted regular expression string is matched with the regular expression pattern M3 indicated by the developer or the user (S32). In this case, when there is a successful substring, the portion of the noise-deleted web document M2 corresponding to the string corresponds to the title of the article as a record of the article list.

본 발명에 따르면, 인터넷의 웹 상에서 HTML 문서의 형태로 지원되는 온라인 잡지와 신문의 각 면을 실시간으로 수집하여, 잡음제거 과정과 패턴정합 과정을 거침으로써, 종래의 방법보다 웹 사이트의 변화에 강하면서도, 개발자가 단순하고 쉬운 방법으로 기사 정보를 추출할 대상을 지적함으로써 실시간으로 원하는 기사 정보를 추출할 수 있다. 더구나, 이와 같이 추출된 신문기사들 중에서 사용자가 선택한 기사를 PDA와 같은 멀티미디아 이동 단말에 자동 설치할 수 있도록 함으로써, 예를 들면, 사용자가 웹 사이트에 존재하는 신문 정보들을 아침에 자동으로 받아서 PDA에 저장한 후, 웹에 접속하지 않고도 이동 중에 용이하게 신문 정보를 습득할 수 있다.According to the present invention, by collecting each side of the online magazines and newspapers supported in the form of HTML documents on the web of the Internet in real time, and through the noise reduction process and pattern matching process, it is more resistant to changes in the website than the conventional method At the same time, the developer can extract the desired article information in real time by pointing out a target to extract the article information in a simple and easy way. In addition, by allowing the user to automatically select articles extracted from the newspaper articles thus extracted on a multimedia mobile terminal such as a PDA, for example, a user automatically receives newspaper information on a web site in the morning and transmits it to the PDA. After storage, newspaper information can be easily obtained on the go without accessing the web.

더구나, 본 발명은, 온라인 신문사와 잡지사들이 제공하는 신문기사들이 갖고 있는 서로 다른 관점들을 비교하거나, 관련이 있는 기사들을 스크랩하거나, 특정 기사의 클라이언트 측에서의 자동 검색을 가능하게 하는 등의 다양한 기능들을 제공한다.Moreover, the present invention provides various functions such as comparing different viewpoints of newspaper articles provided by online newspapers and magazines, scraping related articles, enabling automatic retrieval on the client side of a specific article, and the like. do.

따라서, 본 발명에 따른 뉴스 정보 수집 시스템 및 수집방법은, 푸쉬 솔루션, 개인화된 뉴스 서버, 정보검색 에이젼트 등에 효과적으로 사용될 수 있다.Therefore, the news information collection system and collection method according to the present invention can be effectively used in push solutions, personalized news servers, information search agents and the like.

Claims

An HTTP communication unit that collects, by HTTP communication, each side of a newspaper supported in the form of an HTML document from an online newspaper or magazine site on the web of the Internet,

A noise removing unit for removing noise present in the newspaper article document collected by the HTTP communication unit;

A pattern input unit for inputting a regular expression pattern required by a developer or a user,

The pattern matching unit converts the web document from which the noise is removed by the noise removing unit into a regular expression string, extracts a newspaper article record by matching the converted regular expression string with a regular expression pattern input by a developer or a user through the pattern input unit. News information collection system, characterized in that provided.

The method of claim 1,

And a PDA installation unit for presenting the extracted newspaper articles to a user and automatically installing articles selected by the user on a PDA.

The method of claim 1,

The noise canceling unit,

Each record constituting the list of newspaper article documents collected by the HTTP communication unit is an anchor tag conversion module for converting the surrounding anchor tag into a form that the pattern matching unit is easy to process;

News information collection system characterized in that it comprises a tag removal module for deleting unnecessary tags in the newspaper article converted the anchor tag.

The method of claim 1,

The pattern matching unit,

A regular expression string generator for converting a web document from which noise is removed by the noise removing unit into a regular expression string;

A regular expression that matches a regular expression string converted by the regular expression string generator with a regular expression pattern input by a developer or a user through the pattern input unit, and extracts a web document part corresponding to a successful division string as a record of an article list. News information collection system comprising a string matching device.

Collecting, by HTTP communication, each side of a newspaper supported in the form of an HTML document from an online newspaper or magazine site on the web of the Internet;

Removing noise present in the collected newspaper article;

Entering the regular expression pattern required by the developer or user;

And converting the noise-free web document into a regular expression string, extracting a newspaper article record by matching the converted regular expression string and the input regular expression pattern.

The method of claim 5,

And presenting the extracted newspaper articles to a user and automatically installing articles selected by the user on a PDA.