KR102407312B1

KR102407312B1 - Apparatus and method for crawling data using web archive

Info

Publication number: KR102407312B1
Application number: KR1020200131077A
Authority: KR
Inventors: 최진영; 김현조
Original assignee: 주식회사 어반데이터랩
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2022-06-13
Also published as: KR20220048203A

Abstract

크롤링 모듈에 의해 다양한 웹 사이트로부터 데이터를 수집하는 한편, 웹 아카이브 모듈에 의해 웹 사이트의 과거 데이터도 수집함으로써, 유의미한 데이터의 충분한 수집을 위한 시간을 단축할 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법, 기록 매체가 개시된다. 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치는: 다양한 웹 사이트에서 제목과, 요약 및 작성 시간을 포함하는 기본 목록 구성 정보를 크롤링하도록 구성되는 크롤링 모듈; 상기 웹 사이트의 시간별 변경 데이터를 수집하여 기록하는 웹 아카이브 사이트로부터 상기 웹 사이트의 과거 데이터를 수집하는 웹 아카이브 모듈; 상기 웹 사이트로부터 수집된 상기 기본 목록 구성 정보 및 상기 웹 아카이브 사이트로부터 수집된 상기 과거 데이터를 저장하도록 구성되는 데이터베이스; 및 상기 기본 목록 구성 정보 및 상기 과거 데이터로부터 상기 다양한 웹 사이트가 통합된 사용자 대상 사이트를 생성하도록 구성되는 사용자 대상 사이트 생성 모듈을 포함한다.A data crawling device using a web archive that can reduce the time for sufficient collection of meaningful data by collecting data from various websites by the crawling module, while also collecting historical data of the website by the web archive module, and A method and a recording medium are disclosed. An apparatus for crawling data using a web archive according to an embodiment of the present invention includes: a crawling module configured to crawl basic list configuration information including title, summary, and creation time from various web sites; a web archive module for collecting past data of the web site from a web archive site that collects and records change data of the web site by time; a database configured to store the basic list configuration information collected from the web site and the historical data collected from the web archive site; and a user target site creation module configured to generate a user target site in which the various web sites are integrated from the basic list configuration information and the past data.

Description

Apparatus and method for crawling data using web archive

본 발명은 웹 아카이브(web archive)를 활용한 데이터 크롤링 장치 및 방법에 관한 것으로, 보다 상세하게는 크롤링 모듈에 의해 다양한 웹 사이트로부터 데이터를 수집하는 한편, 웹 아카이브 사이트로부터 웹 사이트의 과거 데이터도 수집함으로써, 유의미한 데이터의 충분한 수집을 위한 시간을 단축할 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for crawling data using a web archive, and more particularly, to collect data from various web sites by a crawling module, while also collecting past data of web sites from a web archive site By doing so, it relates to a data crawling apparatus and method using a web archive that can reduce the time for sufficient collection of meaningful data.

최근 인터넷을 통해 쇼핑몰 등의 서비스를 제공하는 서비스 공급자의 경우 사용자 유입 채널을 늘리고 유저 풀을 공유할 수 있으며, 수요자의 경우에는 한 곳에서 여러 플랫폼의 데이터를 모아서 볼 수 있는 메타 서비스가 각광받고 있다. 이와 같은 메타 서비스는 일반적으로 크롤링(crawling)을 통해 웹 사이트로부터 정보를 수집한다. 현재의 웹 사이트 크롤링 방식은 응용 프로그램 인터페이스(API; Application Program Interface)를 사용하여 호출된 값을 받아오거나 웹 사이트의 각 페이지로 직접 접근하여 해당 페이지 내 데이터를 모으는 방식으로 구현되고 있다.Recently, service providers that provide services such as shopping malls through the Internet can increase user inflow channels and share user pools. . Such meta-services generally collect information from web sites through crawling. The current web site crawling method is implemented by receiving a value called using an application program interface (API) or by directly accessing each page of the web site to collect data within the corresponding page.

이와 같은 종래의 웹 사이트 직접 접근 방식은 일정 시간 내 빈번한 접근 시도시 웹 사이트로의 접근이 쉽게 차단되는 문제가 있다. 웹 사이트는 기본적으로 불특정 상대가 자신의 웹 페이지를 스크랩하지 못하게 막도록 설정될 수 있으며, 이로 인해 웹 사이트로부터 접근이 차단되거나 데이터 크롤링이 제한될 수 있는 한계가 있다.Such a conventional web site direct access method has a problem in that access to the web site is easily blocked when frequent access attempts are made within a predetermined time. A website may be set to prevent an unspecified party from scraping its own web page by default, and this has limitations in which access from the website may be blocked or data crawling may be restricted.

또한, 종래에는 사용자가 비교를 원하는 여러 사이트에 각각 접근하여 데이터를 수집하고, 각 사이트에서 수집한 데이터를 비교, 분석하는 과정을 거쳐야 하므로, 조회 과정이 번거롭고 불편한 문제가 있었다. 또한, 트렌드에 민감한 커머스의 경우 그 특성상 웹 사이트에 항상 최신 목록을 유지해야 할 필요성이 있으나, 이를 유지하는 과정에서 발생하는 과도하거나 빈번한 크롤링은 크롤링 주체 시스템 뿐 아니라 크롤링 대상인 시스템에도 리소스 부하를 주게 되는 문제가 있다.In addition, in the prior art, since the user has to go through a process of collecting data by accessing various sites for which a user wants to compare, and comparing and analyzing the data collected from each site, the inquiry process is cumbersome and inconvenient. In addition, in the case of trend-sensitive commerce, it is necessary to always maintain an up-to-date list on the website due to the nature of it. there is a problem.

또한, 종래의 크롤링 방식은 크롤링을 시작한 시점부터 주기적으로 데이터를 수집하는 것이다. 이러한 크롤링 방식의 경우, 유의미한 데이터의 분석을 위해 일정량 이상의 데이터 축적이 필요한데, 크롤링을 통해 이러한 데이터를 수집하기 위해서는 일정 이상의 기간 소요가 불가피하다. 또한, 단일 머신이 특정 사이트를 주기적으로 수집하는 경우에는 수집하는 측과 수집 대상이 되는 측의 양쪽 사이트 모두에 부하가 발생하며, 대상 사이트가 매우 느릴 경우에는 크롤러의 동작 소요 시간이 스케쥴러에 의해 설정된 수집 단위 시간을 넘어서서 프로세스의 충돌이 발생하는 경우도 존재한다.In addition, the conventional crawling method is to periodically collect data from the time when the crawling is started. In the case of such a crawling method, it is necessary to accumulate more than a certain amount of data for analysis of meaningful data. In addition, when a single machine periodically collects a specific site, a load occurs on both the collecting site and the collecting site. There are cases where the process crashes beyond the collection unit time.

본 발명은 크롤링 모듈에 의해 다양한 웹 사이트로부터 데이터를 수집하는 한편, 웹 아카이브 사이트로부터 웹 사이트의 과거 데이터도 수집함으로써, 유의미한 데이터의 충분한 수집을 위한 시간을 단축할 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법, 기록 매체를 제공하기 위한 것이다.The present invention collects data from various web sites by a crawling module, while also collecting historical data of web sites from web archive sites, thereby reducing the time for sufficient collection of meaningful data. To provide an apparatus and method, and a recording medium.

또한 본 발명은 통합 플랫폼에 상품 기본 정보 목록을 최신 상태로 유지함과 동시에, 이 과정에서 발생하는 과도하거나 빈번한 크롤링을 감소시킴으로써 크롤링 서버 및 크롤링 대상 시스템의 리소스 부하를 줄일 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법, 기록 매체를 제공하기 위한 것이다.In addition, the present invention maintains the basic product information list on the integrated platform up to date, and at the same time reduces excessive or frequent crawls that occur in this process, thereby reducing the resource load of the crawling server and the crawling target system. Data using a web archive It is to provide a crawling apparatus and method, and a recording medium.

또한, 본 발명은 상품의 기본 목록 구성만을 크롤링한 후, 상품의 상세 정보를 요청한 사용자 디바이스를 통해 웹 사이트의 상품 상세 정보를 간접적으로 전달받아 상품 상세 데이터를 수집함으로써 서버의 크롤링 부담을 줄이고 웹 사이트에 의한 크롤링 제한을 방지할 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법, 기록 매체를 제공하기 위한 것이다.In addition, the present invention reduces the crawling burden of the server and reduces the crawling burden of the server by collecting detailed product data by indirectly receiving detailed product information of a website through a user device that has requested detailed product information after crawling only the basic list of products. An object of the present invention is to provide an apparatus and method for crawling data using a web archive that can prevent crawling restrictions and a recording medium.

또한, 본 발명은 크롤링 대상 쇼핑몰의 상품 선호도(우선 순위) 기반으로 선택적으로 상품 상세 정보를 수집하여, 일부 상품에 대한 상세 정보만을 수집하여 전체 상품에 대한 상세 정보를 수집한 것에 가까운 효과를 발휘할 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법, 기록 매체를 제공하기 위한 것이다.In addition, the present invention selectively collects detailed product information based on product preference (priority) of the crawling target shopping mall, and collects only detailed information on some products, thereby exhibiting an effect close to collecting detailed information on all products. It is to provide a data crawling apparatus and method using a web archive, and a recording medium.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치는: 다양한 웹 사이트에서 제목과, 요약 및 작성 시간을 포함하는 기본 목록 구성 정보를 크롤링하도록 구성되는 크롤링 모듈; 상기 웹 사이트의 시간별 변경 데이터를 수집하여 기록하는 웹 아카이브 사이트로부터 상기 웹 사이트의 과거 데이터를 수집하는 웹 아카이브 모듈; 상기 웹 사이트로부터 수집된 상기 기본 목록 구성 정보 및 상기 웹 아카이브 사이트로부터 수집된 상기 과거 데이터를 저장하도록 구성되는 데이터베이스; 및 상기 기본 목록 구성 정보 및 상기 과거 데이터로부터 상기 다양한 웹 사이트가 통합된 사용자 대상 사이트를 생성하도록 구성되는 사용자 대상 사이트 생성 모듈을 포함한다.An apparatus for crawling data using a web archive according to an embodiment of the present invention includes: a crawling module configured to crawl basic list configuration information including title, summary, and creation time from various web sites; a web archive module for collecting past data of the web site from a web archive site that collects and records change data of the web site by time; a database configured to store the basic list configuration information collected from the web site and the historical data collected from the web archive site; and a user target site creation module configured to generate a user target site in which the various web sites are integrated from the basic list configuration information and the past data.

상기 웹 아카이브 모듈은 웨이백 머신(wayback machine)을 기반으로 상기 웹 아카이브 사이트로부터 상기 웹 사이트의 과거 데이터를 수집하도록 구성될 수 있다.The web archive module may be configured to collect historical data of the web site from the web archive site based on a wayback machine.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치는: 상기 사용자 대상 사이트에서 사용자 디바이스로부터 상세 정보 요청이 입력되면, 상기 사용자 디바이스를 통해 상기 상세 정보 요청과 관련된 대상 웹 사이트에서 상기 기본 목록 구성 정보와 관련된 상세 정보를 수집하여 상기 데이터베이스에 저장시키도록 구성되는 에뮬레이터 모듈을 더 포함할 수 있다.A data crawling apparatus using a web archive according to an embodiment of the present invention includes: when a detailed information request is input from a user device in the user target site, the basic list in the target website related to the detailed information request through the user device It may further include an emulator module configured to collect detailed information related to the configuration information and store it in the database.

상기 에뮬레이터 모듈은, 상기 사용자 디바이스로 하여금 상기 대상 웹 사이트에 상기 상세 정보를 열람하게 하고; 상기 사용자 디바이스에서 상기 대상 웹 사이트를 열람하여 수집한 상기 상세 정보를 상기 사용자 디바이스로부터 전달받고; 그리고 상기 사용자 디바이스로부터 전달받은 상기 상세 정보를 상기 기본 목록 구성 정보와 연동하여 상기 데이터베이스에 저장시키도록 구성될 수 있다.the emulator module is configured to cause the user device to view the detailed information on the target website; receiving the detailed information collected by browsing the target website in the user device from the user device; And it may be configured to store the detailed information received from the user device in the database in association with the basic list configuration information.

상기 에뮬레이터 모듈은, 상기 사용자 디바이스로부터 상기 상세 정보 요청이 입력되면, 상기 상세 정보가 상기 데이터베이스에 저장되어 있는지 판단하고; 상기 상세 정보가 상기 데이터베이스에 저장되어 있는 경우, 상기 데이터베이스에 저장된 상세 정보를 상기 사용자 대상 사이트를 통해 상기 사용자 디바이스로 제공하고; 상기 상세 정보가 상기 데이터베이스에 저장되어 있지 않은 경우, 상기 사용자 디바이스에서 응용 프로그램 어플리케이션을 실행하여 상기 대상 웹 사이트에 접근하게 하고; 그리고 상기 사용자 디바이스에서 실행되는 응용 프로그램 어플리케이션에 의해 상기 대상 웹 사이트로부터 상기 상세 정보를 수집하여 상기 데이터베이스에 저장하도록 구성될 수 있다.the emulator module, when the detailed information request is input from the user device, determines whether the detailed information is stored in the database; when the detailed information is stored in the database, providing the detailed information stored in the database to the user device through the user target site; when the detailed information is not stored in the database, execute an application program on the user device to access the target website; And it may be configured to collect the detailed information from the target website by an application program running on the user device and store it in the database.

상기 상세 정보는 상기 기본 목록 구성 정보와 관련된 상품의 상세 이미지, 상품의 크기, 중량, 재질, 색상 및 부가 설명 정보를 포함할 수 있다.The detailed information may include a detailed image of a product related to the basic list configuration information, product size, weight, material, color, and additional description information.

상기 에뮬레이터 모듈은: 상기 대상 웹 사이트에 포함된 상품들 중, 상기 사용자 대상 사이트에서 상기 상세 정보 요청이 입력된 사용자 대상 상품의 선호도 순위가 기 설정된 순위 조건을 만족하는지 판단하고; 상기 사용자 대상 상품의 선호도 순위가 상기 기 설정된 순위 조건을 만족하면, 상기 사용자 디바이스를 통해 상기 대상 웹 사이트에서 상기 사용자 대상 상품의 상세 정보를 수집하고; 그리고 상기 사용자 대상 상품의 선호도 순위가 상기 기 설정된 순위 조건을 만족하지 않으면, 상기 사용자 디바이스를 통해 상기 사용자 대상 상품의 상세 정보를 수집하지 않도록 구성될 수 있다.The emulator module is configured to: determine whether a preference ranking of a user target product to which the detailed information request is input from the user target site, among products included in the target website, satisfies a preset ranking condition; collecting detailed information of the user target product from the target website through the user device when the preference ranking of the user target product satisfies the preset ranking condition; And when the preference ranking of the user target product does not satisfy the preset ranking condition, it may be configured not to collect detailed information of the user target product through the user device.

상기 크롤링 모듈은, 상기 다양한 대상 웹 사이트 별로 상기 기본 목록 구성 정보의 배열 패턴을 분석하고, 상기 배열 패턴을 기반으로 크롤링 대상 블록 및 상기 크롤링 대상 블록의 속성 및 필드를 지정하도록 구성되는 크롤링 대상 분석기; 및 상기 다양한 대상 웹 사이트로부터 상기 크롤링 대상 블록 및 상기 크롤링 대상 블록의 속성 및 필드를 기반으로 상기 기본 목록 구성 정보를 크롤링하도록 구성되는 크롤러를 포함할 수 있다. 상기 사용자 대상 사이트 생성 모듈은 상기 다양한 대상 웹 사이트로부터 수집되는 다양한 상품에 대한 상품 데이터를 지정된 객체 패턴으로 통합하여 상기 사용자 대상 사이트에 표시하도록 구성될 수 있다.The crawling module may include: a crawling target analyzer configured to analyze an arrangement pattern of the basic list configuration information for each of the various target web sites, and to designate a crawling target block and properties and fields of the crawling target block based on the arrangement pattern; and a crawler configured to crawl the basic list configuration information based on the crawling target block and attributes and fields of the crawling target block from the various target web sites. The user target site creation module may be configured to integrate product data for various products collected from the various target web sites into a designated object pattern and display it on the user target site.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법은: 크롤링 모듈에 의해, 다양한 웹 사이트에서 제목과, 요약 및 작성 시간을 포함하는 기본 목록 구성 정보를 크롤링하는 단계; 웹 아카이브 모듈에 의해, 상기 웹 사이트의 시간별 변경 데이터를 수집하여 기록하는 웹 아카이브 사이트로부터 상기 웹 사이트의 과거 데이터를 수집하는 단계; 상기 웹 사이트로부터 수집된 상기 기본 목록 구성 정보 및 상기 웹 아카이브 사이트로부터 수집된 상기 과거 데이터를 데이터베이스에 저장하는 단계; 및 사용자 대상 사이트 생성 모듈에 의해, 상기 기본 목록 구성 정보 및 상기 과거 데이터로부터 상기 다양한 웹 사이트가 통합된 사용자 대상 사이트를 생성하는 단계를 포함한다.A data crawling method using a web archive according to an embodiment of the present invention includes: crawling, by a crawling module, basic list configuration information including title, summary, and creation time from various web sites; collecting, by a web archive module, past data of the web site from a web archive site that collects and records change data of the web site by time; storing the basic list configuration information collected from the web site and the past data collected from the web archive site in a database; and generating, by a user target site creation module, a user target site in which the various web sites are integrated from the basic list configuration information and the past data.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법은: 에뮬레이터 모듈에 의해, 상기 사용자 대상 사이트에서 사용자 디바이스로부터 상세 정보 요청이 입력되면, 상기 사용자 디바이스를 통해 상기 상세 정보 요청과 관련된 대상 웹 사이트에서 상기 기본 목록 구성 정보와 관련된 상세 정보를 수집하여 상기 데이터베이스에 저장하는 단계를 더 포함할 수 있다.A data crawling method using a web archive according to an embodiment of the present invention includes: when a detailed information request is input from a user device in the user target site by an emulator module, a target web related to the detailed information request through the user device The method may further include collecting detailed information related to the basic list configuration information from a site and storing it in the database.

상기 상세 정보를 수집하여 상기 데이터베이스에 저장하는 단계는: 상기 사용자 디바이스로 하여금 상기 대상 웹 사이트에 상기 상세 정보를 열람하게 하는 단계; 상기 사용자 디바이스에서 상기 대상 웹 사이트를 열람하여 수집한 상기 상세 정보를 상기 사용자 디바이스로부터 전달받는 단계; 및 상기 사용자 디바이스로부터 전달받은 상기 상세 정보를 상기 기본 목록 구성 정보와 연동하여 상기 데이터베이스에 하는 단계를 포함할 수 있다.The step of collecting and storing the detailed information in the database includes: allowing the user device to view the detailed information on the target website; receiving the detailed information collected by browsing the target website in the user device from the user device; And it may include the step of linking the detailed information received from the user device with the basic list configuration information to the database.

상기 상세 정보를 수집하여 상기 데이터베이스에 저장하는 단계는: 상기 사용자 디바이스로부터 상기 상세 정보 요청이 입력되면, 상기 상세 정보가 상기 데이터베이스에 저장되어 있는지 판단하는 단계; 상기 상세 정보가 상기 데이터베이스에 저장되어 있는 경우, 상기 데이터베이스에 저장된 상세 정보를 상기 사용자 대상 사이트를 통해 상기 사용자 디바이스로 제공하는 단계; 상기 상세 정보가 상기 데이터베이스에 저장되어 있지 않은 경우, 상기 사용자 디바이스에서 응용 프로그램 어플리케이션을 실행하여 상기 대상 웹 사이트에 접근하게 하는 단계; 및 상기 사용자 디바이스에서 실행되는 응용 프로그램 어플리케이션에 의해 상기 대상 웹 사이트로부터 상기 상세 정보를 수집하여 상기 데이터베이스에 저장하는 단계를 포함할 수 있다.The collecting and storing of the detailed information in the database may include: when the detailed information request is input from the user device, determining whether the detailed information is stored in the database; providing the detailed information stored in the database to the user device through the user target site when the detailed information is stored in the database; when the detailed information is not stored in the database, executing an application program on the user device to access the target website; and collecting the detailed information from the target website by an application program executed on the user device and storing the detailed information in the database.

상기 상세 정보를 수집하여 상기 데이터베이스에 저장하는 단계는: 상기 대상 웹 사이트에 포함된 상품들 중, 상기 사용자 대상 사이트에서 상기 상세 정보 요청이 입력된 사용자 대상 상품의 선호도 순위가 기 설정된 순위 조건을 만족하는지 판단하는 단계; 및 상기 사용자 대상 상품의 선호도 순위가 상기 기 설정된 순위 조건을 만족하면, 상기 사용자 디바이스를 통해 상기 대상 웹 사이트에서 상기 사용자 대상 상품의 상세 정보를 수집하는 단계를 포함할 수 있다.The step of collecting the detailed information and storing the detailed information in the database may include: among the products included in the target website, a preference rank of the user target product for which the detailed information request is inputted from the user target site satisfies a preset ranking condition determining whether and when the preference ranking of the user target product satisfies the preset ranking condition, collecting detailed information of the user target product from the target website through the user device.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법은: 상기 사용자 대상 상품의 선호도 순위가 상기 기 설정된 순위 조건을 만족하지 않으면, 상기 사용자 디바이스를 통해 상기 사용자 대상 상품의 상세 정보를 수집하지 않도록 할 수 있다.In the data crawling method using a web archive according to an embodiment of the present invention: If the preference ranking of the user target product does not satisfy the preset ranking condition, detailed information of the user target product is not collected through the user device can prevent it

상기 기본 목록 구성 정보를 크롤링하는 단계는: 상기 다양한 대상 웹 사이트 별로 상기 기본 목록 구성 정보의 배열 패턴을 분석하고, 상기 배열 패턴을 기반으로 크롤링 대상 블록 및 상기 크롤링 대상 블록의 속성 및 필드를 지정하는 단계; 및 상기 다양한 대상 웹 사이트로부터 상기 크롤링 대상 블록 및 상기 크롤링 대상 블록의 속성 및 필드를 기반으로 상기 기본 목록 구성 정보를 크롤링하는 단계를 포함할 수 있다. 상기 사용자 대상 사이트를 생성하는 단계는 상기 다양한 대상 웹 사이트로부터 수집되는 다양한 상품에 대한 상품 데이터를 지정된 객체 패턴으로 통합하여 상기 사용자 대상 사이트에 표시하는 단계를 포함할 수 있다.The crawling of the basic list configuration information includes: analyzing an arrangement pattern of the basic list configuration information for each of the various target websites, and designating a crawling target block and properties and fields of the crawling target block based on the arrangement pattern step; and crawling the basic list configuration information from the various target web sites based on the crawling target block and attributes and fields of the crawling target block. The generating of the user target site may include integrating product data for various products collected from the various target web sites into a designated object pattern and displaying the integrated product data on the user target site.

본 발명의 실시예에 따르면, 상기 웹 아카이브를 활용한 데이터 크롤링 방법을 실행하기 위한 프로그램이 기록된 컴퓨터로 판독 가능한 기록 매체가 제공된다.According to an embodiment of the present invention, there is provided a computer-readable recording medium in which a program for executing a data crawling method utilizing the web archive is recorded.

본 발명의 실시예에 의하면, 크롤링 모듈에 의해 다양한 웹 사이트로부터 데이터를 수집하는 한편, 웹 아카이브 사이트로부터 웹 사이트의 과거 데이터도 수집함으로써, 유의미한 데이터의 충분한 수집을 위한 시간을 단축할 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법, 기록 매체가 제공된다.According to an embodiment of the present invention, by collecting data from various web sites by the crawling module, while also collecting historical data of web sites from the web archive site, a web archive that can shorten the time for sufficient collection of meaningful data A data crawling apparatus and method using the , and a recording medium are provided.

또한, 본 발명의 실시예에 의하면, 통합 플랫폼에 상품 기본 정보 목록을 최신 상태로 유지함과 동시에, 이 과정에서 발생하는 과도하거나 빈번한 크롤링을 감소시킴으로써 크롤링 서버 및 크롤링 대상 시스템의 리소스 부하를 줄일 수 있는 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법, 기록 매체가 제공된다.In addition, according to an embodiment of the present invention, it is possible to reduce the resource load of the crawling server and the crawling target system by reducing the excessive or frequent crawling that occurs in this process while maintaining the product basic information list up to date on the integrated platform. An apparatus and method for crawling data using a web archive, and a recording medium are provided.

또한, 본 발명의 실시예에 의하면, 상품의 기본 목록 구성만을 크롤링한 후, 상품의 상세 정보를 요청한 사용자 디바이스를 통해 웹 사이트의 상품 상세 정보를 간접적으로 전달받아 상품 상세 데이터를 수집함으로써 서버의 크롤링 부담을 줄이고 웹 사이트에 의한 크롤링 제한을 방지할 수 있다.In addition, according to the embodiment of the present invention, after crawling only the basic list of products, the detailed product information of the website is indirectly transmitted through the user device requesting the detailed product information, and the detailed product data is collected by crawling the server. It can reduce the burden and prevent crawling restrictions by the website.

또한, 본 발명의 실시예에 의하면, 크롤링 대상 쇼핑몰의 상품 선호도(우선 순위) 기반으로 선택적으로 상품 상세 정보를 수집하여, 일부 상품에 대한 상세 정보만을 수집하여 전체 상품에 대한 상세 정보를 수집한 것에 가까운 효과를 발휘할 수 있다.In addition, according to an embodiment of the present invention, detailed information about products is selectively collected based on product preference (priority) of the shopping mall to be crawled, and detailed information about all products is collected by collecting only detailed information about some products. can have a close effect.

도 1은 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치의 구성도이다.
도 2는 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치를 구성하는 크롤링 모듈의 구성도이다.
도 3은 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치를 구성하는 크롤링 대상 분석기의 구성도이다.
도 4는 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법의 순서도이다.
도 5는 도 4의 단계 S110을 나타낸 순서도이다.
도 6은 도 4의 단계 S160을 나타낸 순서도이다.
도 7은 도 4의 단계 S160의 다른 예를 나타낸 순서도이다.
도 8은 도 4의 단계 S160의 또 다른 예를 나타낸 순서도이다.1 is a block diagram of a data crawling apparatus utilizing a web archive according to an embodiment of the present invention.
2 is a block diagram of a crawling module constituting an apparatus for crawling data using a web archive according to an embodiment of the present invention.
3 is a block diagram of a crawling target analyzer constituting a data crawling apparatus using a web archive according to an embodiment of the present invention.
4 is a flowchart of a data crawling method using a web archive according to an embodiment of the present invention.
5 is a flowchart illustrating step S110 of FIG. 4 .
6 is a flowchart illustrating step S160 of FIG. 4 .
7 is a flowchart illustrating another example of step S160 of FIG. 4 .
8 is a flowchart illustrating another example of step S160 of FIG. 4 .

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

본 명세서에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 본 명세서에서 사용되는 '~부, ~모듈'은 적어도 하나의 기능이나 동작을 처리하는 단위로서, 예를 들어 소프트웨어, FPGA 또는 하드웨어 구성요소를 의미할 수 있다. '~부, ~모듈'에서 제공하는 기능은 복수의 구성요소에 의해 분리되어 수행되거나, 다른 추가적인 구성요소와 통합될 수도 있다. 본 명세서의 '~부, ~모듈'은 반드시 소프트웨어 또는 하드웨어에 한정되지 않으며, 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고, 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 이하에서는 도면을 참조하여 본 발명의 실시예에 대해서 구체적으로 설명하기로 한다.In the present specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. As used herein, '~ unit, ~ module' is a unit that processes at least one function or operation, and may refer to, for example, software, FPGA, or hardware component. The functions provided by '~ unit, ~ module' may be performed separately by a plurality of components, or may be integrated with other additional components. 'Part, ~ module' in this specification is not necessarily limited to software or hardware, and may be configured to be in an addressable storage medium, or may be configured to reproduce one or more processors. Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치 및 그 방법은 크롤링 모듈에 의해 다양한 웹 사이트로부터 데이터를 수집하는 한편, 웹 아카이브 모듈에 의해 웹 아카이브 사이트로부터 웹 사이트의 과거 데이터도 수집함으로써, 유의미한 데이터의 충분한 수집을 위한 시간을 단축하기 위해 제공될 수 있다.A data crawling apparatus and method using a web archive according to an embodiment of the present invention collect data from various web sites by a crawling module, while also collecting past data of a web site from a web archive site by a web archive module. , can be provided to shorten the time for sufficient collection of meaningful data.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법은 크롤링 서버에서 다양한 쇼핑몰 사이트에 대해 기본 상품 목록 구성을 우선적으로 크롤링하고, 이후 사용자 디바이스가 크롤링 서버에 상품 상세 정보 요청시 해당 사용자 디바이스에서 대상 쇼핑몰 사이트로 접근하게 하여, 사용자 디바이스에서 대상 쇼핑몰 사이트에 조회하여 얻은 데이터를 크롤링 서버로 전송받음으로써, 다양한 상품들 중 사용자 디바이스에 의해 요청된 상품에 대한 상세 정보 만을 크롤링할 수 있다.In the data crawling apparatus and method using a web archive according to an embodiment of the present invention, a crawling server crawls a basic product list configuration for various shopping mall sites, and then, when a user device requests detailed product information from the crawling server, the user By allowing the device to access the target shopping mall site and receiving data obtained by inquiring the target shopping mall site from the user device to the crawling server, only detailed information about the product requested by the user device among various products can be crawled.

이에 따라, 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치 및 그 방법에 의하면, 통합 플랫폼에 온라인 쇼핑몰 등의 상품 등에 관한 기본 정보 목록을 항상 최신 상태로 유지할 수 있음과 동시에, 상세 정보를 수집하는 과정에서 발생할 수 있는 과도하거나 빈번한 크롤링을 감소시킴으로써 시스템 리소스 부하를 줄일 수 있다.Accordingly, according to the data crawling apparatus and method using the web archive according to the embodiment of the present invention, it is possible to always keep a list of basic information about products, such as online shopping malls, etc. up to date on the integrated platform, and at the same time, detailed information System resource load can be reduced by reducing excessive or frequent crawls that may occur in the process of collecting data.

도 1은 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치의 구성도이다. 이하에서는 온라인 쇼핑몰에서 상품 데이터를 수집하는 경우를 예로 들어, 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법에 대해 설명한다. 그러나, 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치 및 방법은 온라인 쇼핑몰 외에도 다양한 웹 사이트로부터 상품 관련 데이터 외의 다양한 데이터를 수집하도록 응용될 수 있음을 미리 밝혀둔다.1 is a block diagram of a data crawling apparatus utilizing a web archive according to an embodiment of the present invention. Hereinafter, an apparatus and method for crawling data using a web archive according to an embodiment of the present invention will be described, taking the case of collecting product data in an online shopping mall as an example. However, it is stated in advance that the data crawling apparatus and method using a web archive according to an embodiment of the present invention can be applied to collect various data other than product-related data from various web sites in addition to an online shopping mall.

도 1을 참조하면, 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치(100)는 크롤링 대상 쇼핑몰(10)의 상품의 기본 목록 구성 정보(예를 들어, 제목, 상품명, 작성 일시, 요약, 썸네일, 상품 가격 등)를 사용자 대상 사이트에 통합하여 제공하고, 상품들 중 사용자에 의해 요청된 상품이 상세 정보를 사용자 디바이스(20)를 통해 수집하는 크롤링 서버(crawling server)로 제공될 수 있다.Referring to FIG. 1 , the data crawling apparatus 100 using a web archive according to an embodiment of the present invention provides basic list configuration information (eg, title, product name, creation date and time, Summary, thumbnail, product price, etc.) are integrated and provided on the user target site, and the product requested by the user among the products can be provided as a crawling server that collects detailed information through the user device 20 have.

사용자 디바이스(20)는 다양한 쇼핑몰 웹 사이트에서 상품 정보를 조회하여 비교, 분석하고자 하는 다양한 사용자의 단말기일 수 있으며, 예를 들어 스마트폰, 스마트패드, 태블릿 PC, 노트북 등의 랩탑(laptop), 데스크탑 컴퓨터 등의 단말기일 수 있으나, 이에 한정되는 것은 아니다.The user device 20 may be a terminal of various users who want to inquire, compare, and analyze product information on various shopping mall websites, for example, a laptop such as a smartphone, a smart pad, a tablet PC, a notebook computer, or a desktop computer. It may be a terminal such as a computer, but is not limited thereto.

상품의 기본 목록 구성 정보는 쇼핑몰 사이트(12, 14, 16)의 상품에 관한 개략적인 정보를 표시하는 화면 상에 제공되는 웹 사이트의 웹 페이지에 포함된 정보일 수 있다. 따라서, 상품의 기본 목록 구성 정보는 쇼핑몰 사이트(12, 14, 16) 별로 상이할 수 있다.The basic list configuration information of products may be information included in a web page of a web site provided on a screen displaying schematic information about products of the shopping mall sites 12 , 14 , and 16 . Accordingly, the basic list configuration information of products may be different for each shopping mall site 12 , 14 , and 16 .

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치(100)는 먼저 각 쇼핑몰 사이트(12, 14, 16)에서 상품의 개략적인 정보를 표시하는 화면에 표시되는 웹 페이지의 데이터에서 설정된 기본 목록 구성에 해당하는 정보를 수집할 수 있다.The data crawling apparatus 100 using the web archive according to the embodiment of the present invention is first set in the data of the web page displayed on the screen displaying the outline information of the products in each shopping mall site (12, 14, 16). You can collect information corresponding to the list configuration.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치(100)는 사용자 대상 사이트를 통해 다수의 사용자에게 상품 데이터를 제공하고, 다수의 사용자 디바이스(20)를 통해 상품의 상세 정보를 수집하여 다수의 쇼핑몰 사이트를 통합한 상품 정보를 제공하는 서버로 제공될 수 있다.The data crawling apparatus 100 using a web archive according to an embodiment of the present invention provides product data to a plurality of users through a user target site, and collects detailed information about the product through a plurality of user devices 20 , It may be provided as a server that provides product information integrated with a plurality of shopping mall sites.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치(100)는 크롤링 모듈(110), 웹 아카이브 모듈(120), 에뮬레이터 모듈(130), 사용자 대상 사이트 생성 모듈(140) 및 데이터베이스(150)를 포함할 수 있다.The data crawling apparatus 100 using a web archive according to an embodiment of the present invention includes a crawling module 110 , a web archive module 120 , an emulator module 130 , a user target site creation module 140 , and a database 150 . ) may be included.

크롤링 모듈(110)은 크롤링 대상 쇼핑몰(10)에 해당하는 다양한 쇼핑몰 사이트(12, 14, 16)에서 기본 목록 구성 정보(예를 들어, 제목, 상품명, 요약, 썸네일, 상품 가격 및 작성 일시 등)와, 상품의 상세 정보(예를 들어, 사용자 대상 상품의 상품 상세 이미지, 크기, 중량, 재질, 색상 및 부가 설명 등) 중 기본 목록 구성 정보만을 우선적으로 크롤링하도록 구성될 수 있다.The crawling module 110 provides basic list configuration information (eg, title, product name, summary, thumbnail, product price and date of creation, etc.) in various shopping mall sites 12 , 14 , 16 corresponding to the crawling target shopping mall 10 . And, it may be configured to preferentially crawl only basic list configuration information among detailed product information (eg, product detailed image, size, weight, material, color, and additional description of the product for the user).

크롤링 대상 쇼핑몰(10)은 상품의 기본 목록 구성 정보와 상세 정보를 포함하는 원본 데이터(original data)를 보유하고 있는 쇼핑몰 웹 사이트일 수 있으며, 서버에 지정된 쇼핑몰 웹 사이트 또는 관리자 단말기(도시 생략)에 의해 지정된 쇼핑몰 웹 사이트일 수 있으나, 이에 한정되는 것은 아니다.The crawling target shopping mall 10 may be a shopping mall website having original data including basic list configuration information and detailed information of products, and may be stored on a shopping mall website designated in a server or an administrator terminal (not shown). It may be a shopping mall website designated by the company, but is not limited thereto.

도 2는 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치를 구성하는 크롤링 모듈의 구성도이다. 도 1 및 도 2를 참조하면, 크롤링 모듈(110)은 크롤링 대상 분석기(112) 및 크롤러(114)를 포함할 수 있다.2 is a block diagram of a crawling module constituting an apparatus for crawling data using a web archive according to an embodiment of the present invention. 1 and 2 , the crawling module 110 may include a crawling target analyzer 112 and a crawler 114 .

실시예에서, 크롤링 모듈(110)은 쇼핑몰 사이트 별로 문서 객체 모델의 패턴과, 쇼핑몰 사이트에서 사용자 디바이스(20)에 제공할 크롤링 대상 블록을 분석하여 기본 목록 구성 정보를 크롤링할 수 있다. 일 예로, 크롤링 모듈(110)은 하나 또는 여러 개의 가상 머신으로 구성될 수 있다.In an embodiment, the crawling module 110 may crawl basic list configuration information by analyzing a pattern of a document object model for each shopping mall site and a crawling target block to be provided to the user device 20 at the shopping mall site. As an example, the crawling module 110 may include one or several virtual machines.

크롤링 대상 분석기(112)는 다양한 쇼핑몰 사이트(12, 14, 16) 별로 기본 목록 구성 정보의 배열 패턴을 분석하고, 배열 패턴을 기반으로 크롤링 대상 블록 및 크롤링 대상 블록의 속성 및 필드를 지정하도록 구성될 수 있다.The crawling target analyzer 112 is configured to analyze the arrangement pattern of basic list configuration information for each various shopping mall sites 12, 14, and 16, and to designate the crawling target block and properties and fields of the crawling target block based on the arrangement pattern. can

크롤링 모듈(110)은 다양한 쇼핑몰 사이트(12, 14, 16) 별로 문서 객체 모델의 패턴을 인식하여, 쇼핑몰 사이트(12, 14, 16) 별로 인식한 문서 객체 모델 패턴을 기반으로 쇼핑몰 사이트(12, 14, 16)의 웹 페이지에서 필요로 하는 정보를 추출할 수 있다.The crawling module 110 recognizes the pattern of the document object model for each of the various shopping mall sites 12, 14, and 16, and based on the document object model pattern recognized for each of the shopping mall sites 12, 14, 16, the shopping mall site 12, 14, 16), the necessary information can be extracted from the web page.

쇼핑몰 사이트의 문서 객체 모델은 프로그램이나 스크립트가 웹 페이지 내의 구성 요소들에 접근하여 내용이나 스타일 등을 변경할 수 있게 해주는 인터페이스일 수 있다. 브라우저에서 HTML, XML 등의 웹 페이지가 로딩되면 문서 객체 모델은 브라우저 내에 트리(tree) 형태로 문서 내 요소들을 구성하게 된다.The document object model of the shopping mall site may be an interface that allows a program or script to access components in a web page to change content or style. When a web page such as HTML or XML is loaded in the browser, the document object model composes the elements in the document in the form of a tree in the browser.

크롤러(114)는 다양한 대상 쇼핑몰 사이트(12, 14, 16)로부터 크롤링 대상 블록 및 크롤링 대상 블록의 속성 및 필드를 기반으로 기본 목록 구성 정보를 크롤링하도록 구성될 수 있다.The crawler 114 may be configured to crawl basic list configuration information based on the crawling target block and the properties and fields of the crawling target block from various target shopping mall sites 12 , 14 , and 16 .

크롤러(114)는 설정된 주기마다 각 웹 사이트에 접근하여 크롤링을 수행할 수 있다. 크롤러(114)에 의해 크롤링된 데이터(기본 목록 구성 정보)는 데이터베이스(150)에 저장될 수 있다.The crawler 114 may perform crawling by accessing each website at a set period. Data crawled by the crawler 114 (basic list configuration information) may be stored in the database 150 .

상품의 기본 목록 구성 정보는 크롤러 모듈(110) 외에, 관리자 단말기(도시 생략)에 의해 추가로 수집될 수도 있다. 관리자 단말기는 다수개의 쇼핑몰 사이트(12, 14, 16) 별로 마이닝 블록, 속성 및 필드 등을 지정하고, URL 리스트 관리, 에러 로그 관리, 인증 관리 등을 수행할 수 있다.The basic list configuration information of the product may be additionally collected by an administrator terminal (not shown) in addition to the crawler module 110 . The manager terminal may designate mining blocks, attributes and fields for each of the plurality of shopping mall sites 12, 14, and 16, and may perform URL list management, error log management, authentication management, and the like.

이때, 관리자는 관리자 단말기를 이용하여 쇼핑몰 사이트(12, 14, 16)에서 수집한 웹 페이지들의 구조를 기반으로, 마이닝 블록 지정, 속성(attribute) 지정, 필드(field) 지정, URL 리스트 관리, 에러 로그 관리, 인증 관리 등을 수행할 수 있다.At this time, the administrator uses the manager terminal to designate mining blocks, designate attributes, designate fields, manage URL list, manage errors, based on the structure of web pages collected from the shopping mall sites 12, 14, and 16 Log management, authentication management, etc. can be performed.

관리자는 관리자 단말기를 이용하여 쇼핑몰 웹 사이트 리스트, 마이닝 영역, 수집 주기, 에러 로그(Error Log) 등을 그래픽 유저 인터페이스(GUI; Graphic User Interface) 형태로 관리할 수 있다. 이러한 관리자 단말기의 기능은 크롤링을 수행하는 서버에 통합될 수도 있다.An administrator may manage a shopping mall website list, a mining area, a collection cycle, an error log, and the like in the form of a graphic user interface (GUI) by using the manager terminal. The function of the manager terminal may be integrated into a server that performs crawling.

도 3은 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치를 구성하는 크롤링 대상 분석기의 구성도이다. 도 1 내지 도 3을 참조하면, 크롤링 대상 분석기(112)는 쇼핑몰 사이트에서 웹 크롤링을 위한 크롤링 대상을 분석할 수 있다.3 is a block diagram of a crawling target analyzer constituting a data crawling apparatus using a web archive according to an embodiment of the present invention. 1 to 3 , the crawling target analyzer 112 may analyze a crawling target for web crawling in a shopping mall site.

크롤링 대상 분석기(112)는 마이닝 블록 지정부(1122), 속성 지정부(1124), 필드 지정부(1126), URL 리스트 관리부(1128), 에러 로그 관리부(1130), 인증 관리부(1132), 패턴 인식기(1134), 스크랩부(1136), 스케쥴러(1138), 에러 핸들러(1140), API 연동 모듈(1142) 및 데이터 저장부(1144)를 포함할 수 있다.The crawling target analyzer 112 includes a mining block designator 1122, an attribute designator 1124, a field designator 1126, a URL list management part 1128, an error log management part 1130, an authentication management part 1132, a pattern It may include a recognizer 1134 , a scrap unit 1136 , a scheduler 1138 , an error handler 1140 , an API interworking module 1142 , and a data storage unit 1144 .

마이닝 블록 지정부(1122)는 쇼핑몰 사이트(12, 14, 16)로부터 수집된 데이터에 대해 마이닝 블록(Mining Block)을 지정하기 위한 마이닝 블록 지정 UI(User Interface)를 포함할 수 있다.The mining block designation unit 1122 may include a mining block designation UI (User Interface) for designating a mining block for data collected from the shopping mall sites 12 , 14 , and 16 .

마이닝 블록 지정부(1122)는 쇼핑몰 사이트(12, 14, 16)의 웹 페이지 내에서 상품의 상품명과 썸네일 및 가격 등의 정보를 나타내는 다양한 블록들의 위치, 크기 등을 설정 또는 지정할 수 있다.The mining block designation unit 1122 may set or designate positions and sizes of various blocks indicating information such as product names, thumbnails, and prices of products in the web pages of the shopping mall sites 12 , 14 , and 16 .

속성 지정부(1124)는 쇼핑몰 사이트(12, 14, 16)로부터 수집된 데이터에 대해 속성(Attribution)을 지정하기 위한 속성 지정 UI를 포함할 수 있다. 속성 지정부(1124)는 마이닝 블록 지정부(1122)에 의해 지정된 마이닝 블록들의 속성을 설정 또는 지정할 수 있다.The attribute designator 1124 may include an attribute designation UI for designating an attribute for data collected from the shopping mall sites 12 , 14 , and 16 . The attribute designator 1124 may set or designate attributes of the mining blocks designated by the mining block designator 1122 .

속성 지정부(1124)는 예를 들어, 쇼핑몰 사이트(12, 14, 16)의 웹 사이트 별로 상품의 상품명과 썸네일 및 가격 등의 기본 목록 구성 정보에 관한 속성을 설정 또는 지정할 수 있다.The attribute designator 1124 may, for example, set or designate attributes related to basic list configuration information such as a product name, thumbnail, and price of a product for each web site of the shopping mall sites 12 , 14 , and 16 .

필드 지정부(1126)는 웹 사이트로부터 수집된 데이터에 대해 필드(Field)를 지정하기 위한 필드 지정 UI를 포함할 수 있다. 필드 지정부(1126)는 속성 지정부(1124)에 의해 속성이 지정된 마이닝 블록들의 필드를 지정할 수 있다.The field designator 1126 may include a field designation UI for designating a field for data collected from a website. The field designator 1126 may designate the fields of the mining blocks to which the attribute is designated by the attribute designator 1124 .

필드 지정부(1126)는 예를 들어, 쇼핑몰 웹 사이트 별로 상품의 기본 목록 구성 정보에 해당하는 상품명과 썸네일 및 가격 등에 대해 각각의 필드를 설정 또는 지정할 수 있다.The field designator 1126 may set or designate each field for, for example, a product name, thumbnail, and price corresponding to basic list configuration information of products for each shopping mall website.

URL 리스트 관리부(1128)는 쇼핑몰 웹 사이트 별로 웹 페이지들의 URL(Uniform Resource Locator) 리스트를 관리할 수 있다. 에러 로그 관리부(1130)는 웹 사이트에서 데이터 크롤링 수집시의 오류 정보를 나타내는 에러 로그(Error Log)를 관리할 수 있다.The URL list manager 1128 may manage a Uniform Resource Locator (URL) list of web pages for each shopping mall website. The error log management unit 1130 may manage an error log indicating error information when crawling data is collected from a web site.

인증 관리부(1132)는 웹 사이트 접속 및 데이터 지정/관리를 위한 인증(Authentication) 정보를 관리할 수 있다. 인증 관리부(1132)에서 관리하는 인증 정보는 관리자 단말기의 인증 정보, 크롤링을 수행하는 서버의 인증 정보 등을 포함할 수 있다.The authentication manager 1132 may manage authentication information for website access and data designation/management. The authentication information managed by the authentication management unit 1132 may include authentication information of an administrator terminal, authentication information of a server performing crawling, and the like.

패턴 인식기(1134)는 각 웹 사이트 별로 문서 객체 모델의 패턴을 인식할 수 있다. 패턴 인식기(1134)는 웹 사이트 별로 지정된 마이닝 블록, 속성 및 필드를 기반으로 문서 객체 모델의 패턴을 인식할 수 있다.The pattern recognizer 1134 may recognize a pattern of the document object model for each web site. The pattern recognizer 1134 may recognize a pattern of the document object model based on mining blocks, properties, and fields designated for each website.

스크랩부(1136)는 웹 사이트 별로 파악된 문서 객체 모델의 패턴을 기반으로 필요로 하는 정보를 추출할 수 있다. 스크랩부(1136)는 문서 객체 모델의 패턴을 기반으로 웹 사이트의 웹 페이지를 스크래핑하는 스크래핑(scrapping) 모듈과, 웹페이지의 데이터를 파싱(parsing)하는 파서 모듈을 포함할 수 있다.The scrap unit 1136 may extract necessary information based on the pattern of the document object model identified for each web site. The scraper 1136 may include a scraping module for scraping a web page of a web site based on the pattern of the document object model, and a parser module for parsing data of the web page.

스케쥴러(1138)는 웹 사이트의 데이터를 크롤링하기 위한 쓰레드(Thread) 관리 기능을 수행할 수 있다. 에러 핸들러(1140)는 웹 사이트의 데이터를 크롤링하는 과정에서 오류를 관리하는 에러 핸들링 기능을 수행할 수 있다.The scheduler 1138 may perform a thread management function for crawling data of a web site. The error handler 1140 may perform an error handling function for managing errors in the process of crawling data of a web site.

API 연동 모듈(1142)은 웹 사이트의 데이터를 크롤링하기 위한 외부 API 연동 기능을 수행할 수 있다. 서버는 웹 사이트에서 응용 프로그램 인터페이스(API) 제공시 API 연동 모듈(1142)에 의해 API를 통한 데이터 조회를 수행할 수 있다.The API interworking module 1142 may perform an external API interworking function for crawling data of a website. When the web site provides an application program interface (API), the server may perform data inquiry through the API by the API interworking module 1142 .

데이터 저장부(1144)는 크롤링 모듈(110)에 의해 크롤링된 데이터(상품 기본 목록 구성 정보), 에뮬레이터 모듈(130)에 의해 수집된 데이터(상품 상세 정보)를 데이터베이스(150)에 저장할 수 있다.The data storage unit 1144 may store data crawled by the crawling module 110 (basic product list configuration information) and data (detailed product information) collected by the emulator module 130 in the database 150 .

다시 도 1을 참조하면, 웹 아카이브 모듈(120)은 다양한 웹 사이트의 시간별 변경 데이터를 수집하여 기록하는 웹 아카이브 사이트로부터 다양한 웹 사이트의 과거 데이터를 수집할 수 있다.Referring back to FIG. 1 , the web archive module 120 may collect past data of various web sites from a web archive site that collects and records change data of various web sites by time.

웹 아카이브 모듈(120)은 웨이백 머신(wayback machine)을 기반으로 웹 아카이브 사이트로부터 다양한 쇼핑몰 사이트와 같은 웹 사이트들의 과거 데이터를 수집하도록 구성될 수 있다.The web archive module 120 may be configured to collect historical data of web sites such as various shopping mall sites from the web archive site based on a wayback machine.

웹 아카이브 사이트는 인터넷 아카이빙 서비스를 제공하며, 연도별, 월별, 일별로 웹 사이트의 모양 변화를 소스 코드단위까지 수집 및 기록하고 있는 사이트일 수 있다.The web archive site may be a site that provides an internet archiving service, and collects and records changes in the shape of the web site to the source code unit by year, month, and day.

웹 아카이브 모듈(120)은 현재 시점부터 시간의 순서의 반대 방향으로 데이터를 탐색하여, 웹 사이트의 정보가 갱신, 추가, 변경된 시점에 해당하는 과거 데이터를 수집할 수 있다.The web archive module 120 may collect past data corresponding to a time point at which information on a web site is updated, added, or changed by searching for data in the reverse direction of time from the current time point.

웹 사이트의 과거 데이터는 현재 시점으로부터 역순으로 순차적으로 갱신, 추가 또는 변경된 과거의 웹 페이지 정보를 크롤링함으로써 수집될 수 있다. 이때, 과거 데이터의 크롤링 속도를 높이기 위하여 웹 페이지들 중 갱신, 추가 또는 변경된 부분과 관련된 과거 데이터만을 크롤링할 수 있다.The past data of the web site may be collected by crawling the updated, added, or changed past web page information sequentially in reverse order from the current time point. In this case, in order to increase the crawling speed of the past data, only the past data related to the updated, added or changed part of the web pages may be crawled.

실시예에서, 웹 아카이브 모듈(120)은 다양한 웹 사이트 중 우선적으로 웹 아카이브를 수행할 웹 사이트를 선정하여, 우선순위에 따라 웹 사이트를 순차적으로 웹 아카이빙할 수 있다.In an embodiment, the web archive module 120 may select a web site to perform web archiving preferentially among various web sites, and sequentially web archive the web sites according to the priority.

웹 아카이브 모듈(120)은 예를 들어, 기본 목록 구성 정보에 대한 데이터 수집량이 낮은 순으로 웹 사이트의 우선순위를 결정하여 웹 아카이빙을 수행할 수 있다. 이에 따라 다양한 웹 사이트에 대해 유의미한 데이터를 수집하는데 소요되는 시간을 단축할 수 있다.The web archive module 120 may perform web archiving, for example, by determining the priority of web sites in the order of the lowest data collection amount for the basic list configuration information. Accordingly, it is possible to reduce the time required to collect meaningful data for various websites.

상술한 바와 같은 본 발명의 실시예에 의하면, 현재 시점부터의 크롤링 뿐만 아니라 인터넷 아카이빙 서비스인 웨이백 머신(wayback machine)에 기록된 대상 사이트의 과거 형상으로부터 추가적인 데이터를 수집하여, 유의미한 데이터 수집까지의 시간을 단축할 수 있다.According to the embodiment of the present invention as described above, additional data is collected from the past shape of the target site recorded in a wayback machine, which is an Internet archiving service, as well as crawling from the current time, until meaningful data collection is achieved. time can be shortened.

데이터베이스(150)는 다양한 쇼핑몰 사이트(12, 14, 16)로부터 다양한 상품에 대해 각각 기본 목록 구성 정보를 크롤링하여 수집된 다양한 상품에 관한 상품 데이터 뿐 아니라, 웹 아카이브 모듈(120)에 의해 웹 아카이브 사이트로부터 수집된 웹 사이트의 과거 데이터를 저장하도록 구성될 수 있다.The database 150 stores not only product data about various products collected by crawling basic list configuration information for various products from various shopping mall sites 12 , 14 , and 16 , but also a web archive site by the web archive module 120 . may be configured to store historical data of the website collected from

사용자 대상 사이트 생성 모듈(140)은 데이터베이스(150)에 저장된 기본 목록 구성 정보를 포함하는 상품 데이터와, 웹 아카이브를 활용하여 수집된 쇼핑몰 사이트의 과거 데이터로부터 다양한 쇼핑몰 사이트(12, 14, 16)가 통합된 사용자 대상 사이트를 생성할 수 있다.The user target site creation module 140 generates various shopping mall sites 12, 14, and 16 from product data including basic list configuration information stored in the database 150 and past data of the shopping mall site collected using the web archive. You can create a unified user-facing site.

사용자 대상 사이트 생성 모듈(140)은 사용자 디바이스(20)로부터 사용자 대상 사이트에서 적어도 하나의 사용자 대상 상품에 대한 상세 정보 요청을 입력받도록 구성될 수 있다.The user target site creation module 140 may be configured to receive a request for detailed information on at least one user target product at the user target site from the user device 20 .

사용자 대상 사이트 생성 모듈(140)은 다양한 대상 쇼핑몰 사이트(12, 14, 16)로부터 수집되는 다양한 상품에 대한 상품 데이터를 지정된 객체 패턴으로 통합하여 사용자 대상 사이트에 표시하도록 구성될 수 있다.The user target site creation module 140 may be configured to integrate product data for various products collected from the various target shopping mall sites 12 , 14 , and 16 into a designated object pattern and display it on the user target site.

사용자 대상 사이트 생성 모듈(140)은 다양한 쇼핑몰 사이트(12, 14, 16)로부터 수집한 데이터를 통합하여 사용자 디바이스(20)에 일괄하여 제공하는 서비스를 제공할 수 있다. 이에 따라 사용자는 다양한 쇼핑몰 웹 사이트의 정보를 보다 빠르고 효율적으로 파악하여 상호 간에 비교, 분석을 수행할 수 있다.The user target site creation module 140 may integrate data collected from various shopping mall sites 12 , 14 , and 16 to provide a service that is collectively provided to the user device 20 . Accordingly, the user can more quickly and efficiently grasp the information of various shopping mall web sites to perform mutual comparison and analysis.

실시예에서, 사용자 대상 사이트 생성 모듈(140)은 다양한 웹 사이트의 관련된 필드들(동일하거나 유사한 필드들)을 하나의 필드명으로 통일하여 정보를 제공할 수 있다. 이때, 관련된 필드들은 다양한 웹 사이트의 필드들 간의 단어 유사도, 유의어 사전 등을 기반으로 분류될 수 있다.In an embodiment, the user target site creation module 140 may provide information by unifying related fields (same or similar fields) of various web sites into one field name. In this case, the related fields may be classified based on word similarity between fields of various web sites, thesaurus, and the like.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치는 다양한 웹 사이트의 데이터를 한번에 사용자에게 반환할 수 있다. 이에 따라 사용자들은 여러 사이트를 이동하지 않고 한번에 간편하게 데이터를 조회하여 다양한 웹 사이트의 데이터를 비교할 수 있으며, 원하는 데이터를 빠르고 편리하게 찾을 수 있다.A data crawling apparatus utilizing a web archive according to an embodiment of the present invention may return data of various web sites to a user at once. Accordingly, users can easily search data at once without moving to multiple sites, compare data from various websites, and find desired data quickly and conveniently.

또한, 사용자들이 크롤링 서버에 자체적으로 데이터를 업로드할 수 있도록 구성되는 경우, 웹 사이트의 데이터 뿐 아니라 크롤링 서버에 자체적으로 업로드되는 데이터까지 한번에 보여줄 수도 있다.In addition, if it is configured so that users can upload their own data to the crawling server, it is possible to show not only the data of the website but also the data that is uploaded to the crawling server at once.

에뮬레이터 모듈(130)은 사용자 대상 사이트에 생성된 사용자 대상 상품에 대해 사용자 디바이스(20)로부터 상기 상세 정보 요청이 입력되면, 사용자 디바이스(20)를 통해 사용자 대상 상품에 대응되는 대상 쇼핑몰 사이트에서 사용자 대상 상품의 상세 정보를 수집하여 데이터베이스(150)에 저장시키도록 구성될 수 있다.When the detailed information request is input from the user device 20 for the user target product created on the user target site, the emulator module 130 performs the user target at the target shopping mall site corresponding to the user target product through the user device 20 . It may be configured to collect detailed product information and store it in the database 150 .

실시예에서, 에뮬레이터 모듈(130)은 사용자 디바이스(20)로 하여금 사용자 대상 상품에 해당하는 상품 데이터를 수집한 대상 쇼핑몰 사이트에 상세 정보를 조회하여 열람하게 할 수 있다.In an embodiment, the emulator module 130 may cause the user device 20 to inquire and view detailed information on a target shopping mall site that has collected product data corresponding to the user target product.

에뮬레이터 모듈(130)은 사용자 디바이스(20)에 응용 프로그램(application program)을 실행시켜 대상 쇼핑몰 사이트에 상품 상세 정보를 호출, 조회하게 하고, 사용자 디바이스(20)가 대상 쇼핑몰 사이트로부터 전달받은 상품 상세 정보를 수집하는 방법으로 대상 쇼핑몰 사이트의 상품 상세 정보를 크롤링할 수 있다.The emulator module 130 executes an application program in the user device 20 to call and inquire detailed product information on the target shopping mall site, and the user device 20 receives the detailed product information from the target shopping mall site As a method of collecting , detailed product information of a target shopping mall site may be crawled.

또한, 에뮬레이터 모듈(130)은 사용자 디바이스(20)에서 대상 쇼핑몰 사이트를 열람한 사용자 대상 상품의 상세 정보를 사용자 디바이스(20)로부터 전달받을 수 있다.In addition, the emulator module 130 may receive detailed information of the product for the user who reads the target shopping mall site from the user device 20 from the user device 20 .

에뮬레이터 모듈(130)은 사용자 디바이스(20)로부터 전달받은 상품의 상세 정보를 사용자 대상 상품에 관한 상품 데이터에 포함시켜 데이터베이스(150)에 저장시키도록 구성될 수 있다.The emulator module 130 may be configured to store the detailed information of the product received from the user device 20 in the product data related to the user target product to be stored in the database 150 .

에뮬레이터 모듈(130)은 사용자 디바이스로부터 사용자 대상 상품에 대해 상세 정보 요청이 입력되면, 사용자 대상 상품의 상세 정보가 데이터베이스(150)에 저장되어 있는지 판단할 수 있다.The emulator module 130 may determine whether detailed information of the user target product is stored in the database 150 when a detailed information request for the user target product is input from the user device.

에뮬레이터 모듈(130)은 탐색 대상으로 분류된 상세 페이지에 접근하여 데이터를 읽어 들이고, 분석한 내용을 서버 API를 호출하여 서버로 전달할 수 있다. 에뮬레이터 모듈(130)은 모바일 휴대폰과 같은 사용자 디바이스 외에도, 가상 머신(VM), 데스크탑 PC, 워크스테이션, 라즈베리 파이 등과 같이 독자적인 OS를 탑재하고 있는 가상머신 또는 물리적인 기기 등의 다양한 형태로 셋팅 가능하다.The emulator module 130 may access a detailed page classified as a search target, read data, and transmit the analyzed content to the server by calling a server API. In addition to user devices such as mobile phones, the emulator module 130 can be set in various forms, such as virtual machines or physical devices with its own OS, such as virtual machines (VMs), desktop PCs, workstations, and Raspberry Pi. .

사용자 대상 상품의 상세 정보가 데이터베이스(150)에 저장되어 있는 경우, 에뮬레이터 모듈(130)은 데이터베이스(150)에 저장된 상세 정보를 사용자 대상 사이트를 통해 사용자 디바이스(20)로 제공할 수 있다.When the detailed information of the user target product is stored in the database 150 , the emulator module 130 may provide the detailed information stored in the database 150 to the user device 20 through the user target site.

사용자 대상 상품의 상세 정보가 데이터베이스(150)에 저장되어 있지 않은 경우, 에뮬레이터 모듈(130)은 사용자 디바이스(20)에서 응용 프로그램 어플리케이션을 실행하여 대상 쇼핑몰 사이트에 접근하게 할 수 있다.When detailed information of the user target product is not stored in the database 150 , the emulator module 130 may execute an application program in the user device 20 to access the target shopping mall site.

그리고 나서, 에뮬레이터 모듈(130)은 사용자 디바이스(20)에서 실행되는 응용 프로그램 어플리케이션에 의해 대상 쇼핑몰 사이트로부터 상세 정보를 수집하여 데이터베이스(150)에 저장하도록 구성될 수 있다.Then, the emulator module 130 may be configured to collect detailed information from the target shopping mall site by an application program executed on the user device 20 and store it in the database 150 .

에뮬레이터 모듈(130)은 다수의 가상머신을 통한 분산 병렬 처리를 통해 상품 상세 정보를 수집할 수 있으며 크롤링 처리 속도를 높일 수 있다. 또한, 사용자 디바이스와 유사한 환경인 가벼운 모바일 OS를 가상환경으로 병렬 구동시켜 빠른 속도를 확보할 수 있으며 비용 절감이 가능하다.The emulator module 130 may collect detailed product information through distributed parallel processing through a plurality of virtual machines, and may increase the crawling processing speed. In addition, it is possible to secure high speed and reduce costs by running a lightweight mobile OS, which is an environment similar to user devices, in parallel in a virtual environment.

또한, 에뮬레이터 모듈(130)은 병렬 처리로 분산 호출(call)이 이루어지므로 크롤링 속도 측면에서 우수하고, API를 호출하는 형태를 유지하여 크롤링 대상 서비스 시스템 측에 일반 유저가 상세 페이지를 조회하는 것과 큰 차이 없는 수준으로 부하를 경감시킬 수 있다.In addition, the emulator module 130 is excellent in terms of crawling speed because distributed calls are made in parallel processing, and it maintains the form of calling the API, so that general users inquire the detailed page on the crawl target service system side. The load can be relieved to a level without difference.

네트워크 문제 혹은 사이트 속도 문제 등에 의한 수집 시 발생 가능한 병목 현상을 분산 처리를 통하여 개선할 수 있다. 즉, 처리가 필요한 작업을 작은 단위로 나누어서 다수의 머신 혹은 프로세스에 분산하여 동시에 혹은 각각의 스케쥴에 맞추어 처리할 수 있으며, 작업의 부하를 줄이고 처리 속도도 높일 수 있다.A bottleneck that may occur during collection due to network problems or site speed problems can be improved through distributed processing. That is, the task that needs processing can be divided into small units and distributed to multiple machines or processes so that they can be processed simultaneously or according to each schedule, reducing the workload and increasing the processing speed.

실시예에서, 에뮬레이터 모듈(130)은 대상 쇼핑몰 사이트(12, 14, 16)에 포함된 상품들 중, 상세 정보 요청이 입력된 사용자 대상 상품의 선호도 순위(우선 순위)를 판단하고, 사용자 대상 상품의 선호도 순위가 기 설정된 순위 조건을 만족하는지 판단할 수 있다.In an embodiment, the emulator module 130 determines a preference rank (priority) of a user target product for which a detailed information request is input, among products included in the target shopping mall sites 12, 14, and 16, and determines the user target product It may be determined whether the preference rank of ' satisfies a preset rank condition.

사용자 대상 상품의 선호도 순위가 기 설정된 순위 조건을 만족하는 것으로 판단되면, 에뮬레이터 모듈(130)은 사용자 디바이스(20)를 통해 대상 쇼핑몰 사이트(12, 14, 16)에서 사용자 대상 상품의 상세 정보를 수집할 수 있다.If it is determined that the preference ranking of the user target product satisfies the preset ranking condition, the emulator module 130 collects detailed information of the user target product from the target shopping mall sites 12 , 14 , 16 through the user device 20 . can do.

사용자 대상 상품의 선호도 순위가 상기 기 설정된 순위 조건을 만족하지 않으면, 에뮬레이터 모듈(130)은 사용자 디바이스(20)를 통해 사용자 대상 상품의 상세 정보를 수집하지 않도록 구성될 수 있다.If the preference ranking of the user target product does not satisfy the preset ranking condition, the emulator module 130 may be configured not to collect detailed information of the user target product through the user device 20 .

사용자 대상 상품의 선호도 순위는 예를 들어, 각 쇼핑몰 사이트의 상품 인기 순위, 상품 노출 순서, 크롤링 서버에서 산출하거나 예측한 상품 인기 순위 등을 기반으로 산출될 수 있으나, 이에 한정되는 것은 아니다.The preference ranking of the user target product may be calculated based on, for example, the product popularity ranking of each shopping mall site, the product exposure order, the product popularity ranking calculated or predicted by the crawling server, and the like, but is not limited thereto.

상술한 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치에 의하면, 크롤링 모듈에 의해 다양한 웹 사이트로부터 데이터를 수집하는 한편, 웹 아카이브 모듈에 의해 웹 사이트의 과거 데이터도 수집함으로써, 유의미한 데이터의 충분한 수집을 위한 시간을 단축할 수 있다.According to the data crawling apparatus using the web archive according to the embodiment of the present invention described above, while data from various web sites is collected by the crawling module, the past data of the web site is also collected by the web archive module, thereby providing meaningful data It can shorten the time for sufficient collection of

한편, 일반적으로 웹 사이트에 불특정 상대가 웹 페이지를 스크랩하지 못하게 막는 기능이 구비되어 있는 경우, 관리자 단말기 또는 서버가 웹 사이트에서 데이터를 수집하기 위해 웹 사이트에 반복적으로 접근 시도시 웹 사이트로의 접근이 쉽게 차단되어 웹 사이트의 데이터를 크롤링하지 못하게 될 수 있다.On the other hand, in general, if a website is equipped with a function to prevent an unspecified party from scraping a web page, access to the website when the administrator terminal or server repeatedly attempts to access the website to collect data from the website This can easily be blocked and prevent crawling of website data.

그러나, 본 발명의 실시예에 의하면, 서버가 사용자 디바이스로 하여금 웹 사이트에서 데이터를 호출하게 하여, 웹 사이트의 데이터를 간접적으로 수집하여 크롤링하도록 구성되므로, 웹 사이트에 의해 서버의 크롤링이 제한되는 것을 방지할 수 있다.However, according to an embodiment of the present invention, since the server is configured to crawl by indirectly collecting and crawling the data of the website by causing the user device to call data from the website, the crawling of the server is limited by the website. can be prevented

즉, 본 발명의 실시예는 크롤링 대상에 대한 정보를 수집할 때, 사용자가 조회할 지 여부를 알 수 없는 상세 정보 대신 기본 목록 구성에 필요한 정보만 크롤링하여 신규 목록을 유지하면서, 많이 조회되는 상품의 특성을 분석하여 우선 순위가 높다고 판단되는 상품 정보를 중심으로 상세 정보를 수집할 수 있다.That is, the embodiment of the present invention crawls only the information necessary for the basic list configuration instead of the detailed information that the user does not know whether to inquire when collecting information about the crawling target, and maintains a new list, while maintaining a new list. By analyzing the characteristics of , detailed information can be collected focusing on product information that is judged to have high priority.

따라서, 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 장치에 의하면, 사용자 상품 접근 데이터를 활용하여 상품 선호도(우선 순위) 기반으로 효율적으로 상품 상세 정보를 수집함으로써 일부 데이터 수집으로 전체 데이터 수집에 가까운 효과를 발휘할 수 있다.Therefore, according to the data crawling apparatus using the web archive according to the embodiment of the present invention, the entire data is collected by collecting some data by efficiently collecting detailed product information based on product preference (priority) using user product access data. can have an effect close to

또한, 본 발명의 실시예에 의하면, 다양한 상품들의 기본 목록 구성을 유지하여 사용자의 선택의 폭을 넓히면서도, 서버의 크롤링 부담을 줄일 수 있고, 서버의 잦은 접근에 따라 쇼핑몰 사이트로부터 웹 페이지 접근이 차단되는 것을 방지할 수 있다.In addition, according to the embodiment of the present invention, it is possible to reduce the crawling burden of the server while maintaining the basic list configuration of various products, while widening the user's choice, and accessing the web page from the shopping mall site according to the frequent access of the server is possible. blockage can be prevented.

따라서, 본 발명의 실시예에 의하면, 사용자가 상품의 상세 정보에 접근했을 때 사용자 디바이스를 통해 타겟 서비스 사이트의 추가 정보를 수집하여 크롤링 주체 시스템에 전달하고 상품 접근시 활용함으로써, 사용자 디바이스 자원을 사용하여 중앙 시스템의 부하를 감소하고, 타겟 서비스 시스템 부하 또한 줄일 수 있다.Therefore, according to the embodiment of the present invention, when the user accesses the detailed information of the product, additional information of the target service site is collected through the user device, delivered to the crawling subject system, and utilized when accessing the product, thereby using the user device resource. Thus, the load on the central system can be reduced, and the load on the target service system can also be reduced.

또한, 본 발명의 실시예에 의하면, 사용자 디바이스(20)의 요청에 따라 상품 상세 정보를 수집하므로, 이 과정에서 사용자 디바이스(20)의 상품 상세 정보 요청 히스토리를 수집할 수 있다.In addition, according to the embodiment of the present invention, since detailed product information is collected according to the request of the user device 20 , the product detailed information request history of the user device 20 may be collected in this process.

따라서, 이러한 사용자 디바이스(20)의 상품 상세 정보 요청 히스토리를 기반으로 사용자의 상품 선택 성향 데이터를 분석하여 사용자 대상 사이트로 피드백함으로써 사용자에 적합한 상품 정보를 우선적으로 추천하는 것도 가능하다.Accordingly, it is also possible to preferentially recommend product information suitable for the user by analyzing the product selection propensity data of the user based on the product detailed information request history of the user device 20 and feeding it back to the user target site.

도 4는 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법의 순서도이다. 이하에서 도 1 및 도 4를 참조하여 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법에 대해 설명한다.4 is a flowchart of a data crawling method using a web archive according to an embodiment of the present invention. Hereinafter, a data crawling method using a web archive according to an embodiment of the present invention will be described with reference to FIGS. 1 and 4 .

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법은 크롤링 대상 쇼핑몰(10)의 상품의 기본 목록 구성 정보와 크롤링 대상 쇼핑몰(10)의 과거 데이터를 사용자 대상 사이트에 통합하여 제공할 수 있다. 또한, 상품들 중 사용자에 의해 요청된 상품이 상세 정보를 사용자 디바이스(20)를 통해 수집할 수 있다.The data crawling method using a web archive according to an embodiment of the present invention may provide a user target site by integrating basic list configuration information of products of the crawling target shopping mall 10 and past data of the crawling target shopping mall 10 into a user target site. . In addition, detailed information of the product requested by the user among the products may be collected through the user device 20 .

또한, 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법은 크롤링 대상 쇼핑몰(10)로부터 수집된 기본 목록 정보와 과거 데이터를 기반으로 사용자 대상 사이트를 통해 다수의 사용자에게 상품 데이터를 제공하고, 다수의 사용자 디바이스(20)를 통해 상품의 상세 정보를 추가로 수집할 수 있다.In addition, the data crawling method using the web archive according to the embodiment of the present invention provides product data to a plurality of users through a user target site based on basic list information and past data collected from the crawling target shopping mall 10, and , it is possible to additionally collect detailed information of the product through a plurality of user devices (20).

크롤링 모듈(110)은 크롤링 대상 쇼핑몰(10)에 해당하는 다양한 쇼핑몰 사이트(12, 14, 16)에서 기본 목록 구성 정보(예를 들어, 제목, 상품명, 요약, 썸네일, 상품 가격, 작성 일시 등)와, 상품의 상세 정보(예를 들어, 사용자 대상 상품의 상품 상세 이미지, 크기, 중량, 재질, 색상 및 부가 설명 등) 중 기본 목록 구성 정보만 크롤링할 수 있다(S110).The crawling module 110 provides basic list configuration information (eg, title, product name, summary, thumbnail, product price, creation date and time, etc.) in various shopping mall sites 12 , 14 , 16 corresponding to the crawling target shopping mall 10 . And, only basic list configuration information among detailed product information (eg, detailed product image, size, weight, material, color, and additional description of the product for the user) may be crawled ( S110 ).

이때, 크롤링 모듈(110)은 쇼핑몰 사이트 별로 문서 객체 모델의 패턴과, 쇼핑몰 사이트에서 사용자 디바이스(20)에 제공할 크롤링 대상 블록을 분석하여 기본 목록 구성 정보를 크롤링할 수 있다.In this case, the crawling module 110 may crawl basic list configuration information by analyzing a pattern of a document object model for each shopping mall site and a crawling target block to be provided to the user device 20 at the shopping mall site.

도 5는 도 4의 단계 S110을 나타낸 순서도이다. 도 1, 도 4 및 도 5를 참조하면, 크롤링 모듈(110)은 다양한 쇼핑몰 사이트(12, 14, 16) 별로 기본 목록 구성 정보의 배열 패턴을 분석하고(S112), 배열 패턴을 기반으로 크롤링 대상 블록 및 크롤링 대상 블록의 속성 및 필드를 지정할 수 있다(S114).5 is a flowchart illustrating step S110 of FIG. 4 . 1, 4, and 5, the crawling module 110 analyzes the arrangement pattern of basic list configuration information for each of the various shopping mall sites 12, 14, and 16 (S112), and based on the arrangement pattern, a crawling target The properties and fields of the block and the block to be crawled can be specified (S114).

크롤링 모듈(110)은 다양한 쇼핑몰 사이트(12, 14, 16) 별로 문서 객체 모델의 패턴을 인식하여, 쇼핑몰 사이트(12, 14, 16) 별로 인식한 문서 객체 모델 패턴을 기반으로 쇼핑몰 사이트(12, 14, 16)의 웹 페이지에서 필요로 하는 정보를 추출할 수 있다. 크롤링 모듈(110)은 크롤링 대상 블록 및 크롤링 대상 블록의 속성 및 필드를 기반으로 기본 목록 구성 정보를 크롤링할 수 있다(S116).The crawling module 110 recognizes the pattern of the document object model for each of the various shopping mall sites 12, 14, and 16, and based on the document object model pattern recognized for each of the shopping mall sites 12, 14, 16, the shopping mall site 12, 14, 16), the necessary information can be extracted from the web page. The crawling module 110 may crawl the basic list configuration information based on the crawling target block and the properties and fields of the crawling target block ( S116 ).

다시 도 1 및 도 4를 참조하면, 다양한 쇼핑몰 사이트(12, 14, 16)로부터 다양한 상품에 대해 각각 기본 목록 구성 정보를 크롤링하여 수집된 다양한 상품에 관한 상품 데이터는 데이터베이스(150)에 저장될 수 있다(S120).Referring back to FIGS. 1 and 4 , product data related to various products collected by crawling basic list configuration information for various products from various shopping mall sites 12 , 14 , and 16 may be stored in the database 150 . There is (S120).

웹 아카이브 모듈(120)은 다양한 웹 사이트의 시간별 변경 데이터를 수집하여 기록하고 있는 웹 아카이브 사이트로부터 다양한 웹 사이트의 과거 데이터를 수집할 수 있다(S130).The web archive module 120 may collect past data of various web sites from the web archive site that collects and records change data of various web sites by time (S130).

웹 아카이브 모듈(120)은 현재 시점부터 시간의 순서의 반대 방향으로 데이터를 순차적으로 탐색하여, 웹 사이트의 정보가 갱신, 추가, 변경된 시점에 해당하는 과거 데이터를 수집할 수 있다.The web archive module 120 may sequentially search for data from the current time in the opposite direction of time to collect past data corresponding to the time when information on the web site is updated, added, or changed.

실시예에서, 웹 아카이브 모듈(120)은 다양한 웹 사이트 중 우선적으로 웹 아카이브를 수행할 웹 사이트를 선정하여, 우선순위에 따라 웹 사이트를 순차적으로 웹 아카이빙할 수 있다.In an embodiment, the web archive module 120 may select a web site to preferentially perform web archiving among various web sites, and sequentially web archive the web sites according to the priority.

웹 아카이브 모듈(120)은 예를 들어, 기본 목록 구성 정보에 대한 데이터 수집량이 낮은 순으로 웹 사이트의 우선순위를 결정하여 웹 아카이빙을 수행할 수 있다. 이에 따라 다양한 웹 사이트에 대해 유의미한 데이터를 수집하는데 소요되는 시간을 단축할 수 있다.The web archive module 120 may perform web archiving, for example, by determining the priority of web sites in the order of the lowest data collection amount for the basic list configuration information. Accordingly, it is possible to shorten the time required to collect meaningful data for various websites.

다양한 쇼핑몰 사이트(12, 14, 16)로부터 다양한 상품에 대해 각각 기본 목록 구성 정보를 크롤링하여 수집된 다양한 상품에 관한 상품 데이터와, 웹 아카이브 모듈(120)에 의해 웹 아카이브 사이트로부터 수집된 과거 데이터는 데이터베이스(150)에 저장될 수 있다.Product data on various products collected by crawling basic list configuration information for various products from various shopping mall sites 12, 14, and 16, respectively, and past data collected from the web archive site by the web archive module 120 are It may be stored in the database 150 .

사용자 대상 사이트 생성 모듈(140)은 데이터베이스(150)에 저장된 기본 목록 구성 정보를 포함하는 상품 데이터와, 웹 아카이브를 활용하여 수집된 쇼핑몰 사이트의 과거 데이터로부터 다양한 쇼핑몰 사이트(12, 14, 16)가 통합된 사용자 대상 사이트를 생성할 수 있다(S140).The user target site creation module 140 generates various shopping mall sites 12, 14, and 16 from product data including basic list configuration information stored in the database 150 and past data of the shopping mall site collected using the web archive. An integrated user target site may be created (S140).

사용자 대상 사이트 생성 모듈(140)은 다양한 쇼핑몰 사이트(12, 14, 16)로부터 수집한 데이터를 통합하여 사용자 디바이스(20)에 일괄하여 제공하는 서비스를 제공할 수 있다. 이에 따라 사용자는 다양한 쇼핑몰 웹 사이트의 정보를 보다 빠르고 효율적으로 파악하여 상호 간에 비교, 분석을 수행할 수 있다.The user target site creation module 140 may provide services that are collectively provided to the user device 20 by integrating data collected from various shopping mall sites 12 , 14 , and 16 . Accordingly, the user can more quickly and efficiently grasp the information of various shopping mall web sites to perform mutual comparison and analysis.

본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법에 의하면, 다양한 웹 사이트의 데이터를 한번에 사용자에게 반환할 수 있으며, 사용자들은 여러 사이트들을 이동하지 않고 한번에 데이터를 간편하게 조회하여 여러 웹사이트의 데이터를 비교할 수 있으며, 원하는 데이터를 빠르고 편리하게 찾을 수 있다.According to the data crawling method using the web archive according to the embodiment of the present invention, data of various web sites can be returned to the user at once, and the users can easily search data at once without moving multiple sites to access the data of multiple web sites. You can compare data and find the data you want quickly and conveniently.

또한, 사용자들이 서버에 자체적으로 데이터를 업로드할 수 있도록 구성되는 경우, 웹 사이트의 데이터 뿐 아니라 서버에 자체적으로 업로드되는 데이터까지 한번에 보여줄 수도 있다.In addition, if it is configured so that users can upload their own data to the server, it is possible to show not only the data of the website but also the data that is uploaded to the server at once.

사용자 대상 사이트 생성 모듈(140)은 사용자 디바이스(20)로부터 사용자 대상 사이트에서 적어도 하나의 사용자 대상 상품에 대한 상세 정보 요청을 입력받을 수 있다(S150).The user target site creation module 140 may receive a request for detailed information on at least one user target product on the user target site from the user device 20 ( S150 ).

사용자 디바이스(20)로부터 사용자 대상 사이트에서 적어도 하나의 사용자 대상 상품에 대한 상세 정보 요청이 입력되면, 에뮬레이터 모듈(130)은 사용자 디바이스(20)를 통해 대상 쇼핑몰 사이트에서 사용자 대상 상품의 상세 정보를 수집하여 데이터베이스(150)에 저장할 수 있다(S160).When a request for detailed information on at least one user target product is input from the user device 20 on the user target site, the emulator module 130 collects detailed information of the user target product from the target shopping mall site through the user device 20 . to be stored in the database 150 (S160).

도 6은 도 4의 단계 S160을 나타낸 순서도이다. 도 1, 도 4 및 도 6을 참조하면, 에뮬레이터 모듈(130)은 사용자 디바이스(20)로 하여금 사용자 대상 상품에 해당하는 상품 데이터를 수집한 대상 쇼핑몰 사이트에 상세 정보를 열람하게 할 수 있다(S162).6 is a flowchart illustrating step S160 of FIG. 4 . 1, 4, and 6 , the emulator module 130 may allow the user device 20 to view detailed information on a target shopping mall site that has collected product data corresponding to a user target product (S162). ).

또한, 에뮬레이터 모듈(130)은 사용자 디바이스(20)에서 대상 쇼핑몰 사이트를 열람한 사용자 대상 상품의 상세 정보를 사용자 디바이스(20)로부터 전달받을 수 있다(S164).In addition, the emulator module 130 may receive detailed information of the product for the user who has viewed the target shopping mall site in the user device 20 from the user device 20 ( S164 ).

에뮬레이터 모듈(130)은 사용자 디바이스(20)로부터 전달받은 상품의 상세 정보를 사용자 대상 상품에 관한 상품 데이터에 포함시켜 데이터베이스(150)에 저장할 수 있다(S166).The emulator module 130 may store detailed information of the product delivered from the user device 20 in the product data related to the user target product and store it in the database 150 (S166).

도 7은 도 4의 단계 S160의 다른 예를 나타낸 순서도이다. 도 1, 도 4 및 도 7을 참조하면, 에뮬레이터 모듈(130)은 사용자 디바이스로부터 사용자 대상 상품에 대해 상세 정보 요청이 입력되면, 사용자 대상 상품의 상세 정보가 데이터베이스(150)에 저장되어 있는지 판단할 수 있다(S172). 이때 서버는 URL 정보를 통해 사용자 디바이스가 요청한 정보가 서버에 저장되어 있는지 여부를 판단할 수 있다.7 is a flowchart illustrating another example of step S160 of FIG. 4 . 1, 4 and 7, the emulator module 130 determines whether detailed information of the user target product is stored in the database 150 when a detailed information request for the user target product is input from the user device. It can be (S172). In this case, the server may determine whether information requested by the user device is stored in the server through the URL information.

사용자 대상 상품의 상세 정보가 데이터베이스(150)에 저장되어 있는 경우, 에뮬레이터 모듈(130)은 데이터베이스(150)에 저장된 상세 정보를 사용자 대상 사이트를 통해 사용자 디바이스(20)로 제공할 수 있다(S174, S176).When the detailed information of the user target product is stored in the database 150, the emulator module 130 may provide the detailed information stored in the database 150 to the user device 20 through the user target site (S174, S176).

사용자 대상 상품의 상세 정보가 데이터베이스(150)에 저장되어 있지 않은 경우, 에뮬레이터 모듈(130)은 사용자 디바이스(20)에서 응용 프로그램 어플리케이션을 실행하여 대상 쇼핑몰 사이트에 접근하게 할 수 있다(S174, S178).If the detailed information of the user target product is not stored in the database 150, the emulator module 130 may execute an application program in the user device 20 to access the target shopping mall site (S174, S178) .

그리고 나서, 에뮬레이터 모듈(130)은 사용자 디바이스(20)에서 실행되는 응용 프로그램 어플리케이션에 의해 대상 쇼핑몰 사이트로부터 상세 정보를 수집하여 데이터베이스(150)에 저장할 수 있다(S170).Then, the emulator module 130 may collect detailed information from the target shopping mall site by an application program executed on the user device 20 and store it in the database 150 ( S170 ).

도 8은 도 4의 단계 S160의 또 다른 예를 나타낸 순서도이다. 도 1, 도 4 및 도 8을 참조하면, 에뮬레이터 모듈(130)은 대상 쇼핑몰 사이트(12, 14, 16)에 포함된 상품들 중, 상세 정보 요청이 입력된 사용자 대상 상품의 선호도 순위(우선 순위)를 판단하고(S182), 사용자 대상 상품의 선호도 순위가 기 설정된 순위 조건을 만족하는지 판단할 수 있다(S184).8 is a flowchart illustrating another example of step S160 of FIG. 4 . 1, 4, and 8 , the emulator module 130 sets a preference (priority) of a product for which a detailed information request is input, among products included in the target shopping mall sites 12 , 14 , and 16 . ) may be determined (S182), and it may be determined whether the preference ranking of the user target product satisfies a preset ranking condition (S184).

사용자 대상 상품의 선호도 순위가 기 설정된 순위 조건을 만족하는 것으로 판단되면, 에뮬레이터 모듈(130)은 사용자 디바이스(20)를 통해 대상 쇼핑몰 사이트(12, 14, 16)에서 사용자 대상 상품의 상세 정보를 수집할 수 있다(S184, S186).When it is determined that the preference ranking of the user target product satisfies the preset ranking condition, the emulator module 130 collects detailed information of the user target product from the target shopping mall sites 12 , 14 , 16 through the user device 20 . It can be done (S184, S186).

사용자 대상 상품의 선호도 순위가 상기 기 설정된 순위 조건을 만족하지 않으면, 에뮬레이터 모듈(130)은 사용자 디바이스(20)를 통해 사용자 대상 상품의 상세 정보를 수집하지 않도록 구성될 수 있다(S184, S188).If the preference ranking of the user target product does not satisfy the preset ranking condition, the emulator module 130 may be configured not to collect detailed information of the user target product through the user device 20 (S184, S188).

상술한 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법에 의하면, 크롤링 모듈에 의해 다양한 웹 사이트로부터 데이터를 수집하는 한편, 웹 아카이브 모듈에 의해 웹 사이트의 과거 데이터도 수집함으로써, 유의미한 데이터의 충분한 수집을 위한 시간을 단축할 수 있다.According to the data crawling method using the web archive according to the embodiment of the present invention described above, while data from various web sites is collected by the crawling module, the past data of the web site is also collected by the web archive module, thereby providing meaningful data It can shorten the time for sufficient collection of

또한, 본 발명의 실시예에 의하면, 서버가 사용자 디바이스로 하여금 웹 사이트에서 데이터를 호출하게 하여, 웹 사이트의 데이터를 간접적으로 수집하여 크롤링하도록 구성되므로, 웹 사이트에 의해 서버의 크롤링이 제한되는 것을 방지할 수 있다.In addition, according to an embodiment of the present invention, since the server is configured to indirectly collect and crawl the data of the website by causing the user device to call data from the website, the crawling of the server is limited by the website. can be prevented

본 발명의 실시예에 따른 데이터 크롤링 방법은 크롤링 대상에 대한 정보를 수집할 때, 사용자가 조회할 지 여부를 알 수 없는 상세 정보 대신 기본 목록 구성에 필요한 정보만 크롤링하여 신규 목록을 유지하면서, 많이 조회되는 상품의 특성을 분석하여 우선 순위가 높다고 판단되는 상품 정보를 중심으로 상세 정보를 수집한다.The data crawling method according to an embodiment of the present invention, when collecting information about a crawling target, crawls only the information necessary for configuring the basic list instead of the detailed information that the user does not know whether to inquire, while maintaining a new list, By analyzing the characteristics of the inquired product, detailed information is collected focusing on product information that is judged to have high priority.

따라서, 본 발명의 실시예에 따른 웹 아카이브를 활용한 데이터 크롤링 방법에 의하면, 사용자 상품 접근 데이터를 활용하여 상품 선호도(우선 순위) 기반으로 효율적으로 상품 상세 정보를 수집함으로써 일부 데이터 수집으로 전체 데이터 수집에 가까운 효과를 발휘할 수 있다.Therefore, according to the data crawling method using the web archive according to the embodiment of the present invention, by efficiently collecting detailed product information based on product preference (priority) using user product access data, the entire data is collected by collecting some data. can have an effect close to

따라서, 본 발명의 실시예에 의하면, 사용자가 상품의 상세 정보에 접근했을 때 사용자 디바이스를 통해 타겟 서비스 사이트의 추가 정보를 수집하여 크롤링 주체 시스템에 전달하고 상품 접근시 활용함으로써, 사용자 디바이스 자원을 사용하여 중앙 시스템의 부하를 감소하고, 타겟 서비스 시스템 부하 또한 줄일 수 있다. 또한, 데이터 수집 실패, 데이터 수집 지연 등의 오류에 대하여 전체 크롤러가 영향 받지 않는 이점도 있다.Therefore, according to an embodiment of the present invention, when a user accesses detailed information of a product, additional information of a target service site is collected through a user device, delivered to a crawling subject system, and utilized when accessing a product, thereby using user device resources Thus, the load on the central system can be reduced, and the load on the target service system can also be reduced. In addition, there is an advantage that the entire crawler is not affected by errors such as data collection failure and data collection delay.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/ 또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(Digital Signal Processor), 마이크로컴퓨터, FPGA(Field Programmable Gate Array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다.The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, method, and component described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). Array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers.

처리 장치는 운영 체제 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술 분야에서 통상의 지식을 가진 자는 처리 장치가 복수 개의 처리 요소(Processing Element) 및/또는 복수 유형의 처리요소를 포함할 수 있음을 이해할 것이다.The processing device may run an operating system and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, a processing device is sometimes described as being used, but a person of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It will be understood that this may include

예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(Parallel Processor) 와 같은, 다른 처리 구성(Processing configuration)도 가능하다. 소프트웨어는 컴퓨터 프로그램(Computer Program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a Parallel Processor. The software may include a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device.

소프트웨어 및/ 또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody) 될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software.

컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CDROM, DVD와 같은 광기록 매체(optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CDROMs and DVDs, and ROM, RAM, and flash memory. Hardware devices specially configured to store and execute program instructions, such as, etc. are included. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 다른 구현들, 다른 실시예들 및 청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

10: 크롤링 대상 쇼핑몰
12, 14, 16: 쇼핑몰 사이트
20: 사용자 디바이스
100: 웹 아카이브를 활용한 데이터 크롤링 장치
110: 크롤링 모듈
112: 크롤링 대상 분석기
114: 크롤러
120: 웹 아카이브 모듈
130: 에뮬레이터 모듈
140: 사용자 대상 사이트 생성 모듈
150: 데이터베이스10: Crawl target shopping mall
12, 14, 16: shopping mall site
20: user device
100: Data crawling device using web archive
110: crawling module
112: crawl target analyzer
114: crawler
120: web archive module
130: emulator module
140: User-targeted site creation module
150: database

Claims

a crawling module, configured to crawl basic listing composition information, including title, summary, and creation time, from various websites;
a web archive module for collecting past data of the web site from a web archive site that collects and records change data of the web site by time;
a database configured to store the basic list configuration information collected from the web site and the historical data collected from the web archive site; and
From the basic listing configuration information collected from the various websites and the past data collected from the web archive site in relation to the various websites, from the basic listing configuration information collected from the various websites and the web archive site A data crawling device using a web archive, comprising a user target site creation module configured to generate a user target site in which the collected historical data is integrated.

According to claim 1,
The web archive module is configured to collect past data of the web site from the web archive site based on a wayback machine, a data crawling device using a web archive.

According to claim 1,
When a detailed information request is input from a user device at the user target site, detailed information related to the basic list configuration information is collected from a target website related to the detailed information request through the user device and stored in the database A data crawling device using a web archive, further comprising an emulator module.

4. The method of claim 3,
The emulator module is
cause the user device to view the detailed information on the target website;
receiving the detailed information collected by browsing the target website in the user device from the user device; and
Data crawling apparatus using a web archive, configured to store the detailed information received from the user device in the database in conjunction with the basic list configuration information.

5. The method of claim 4,
The emulator module is
when the detailed information request is input from the user device, determining whether the detailed information is stored in the database;
when the detailed information is stored in the database, providing the detailed information stored in the database to the user device through the user target site;
when the detailed information is not stored in the database, execute an application program on the user device to access the target website; and
Data crawling apparatus utilizing a web archive, configured to collect the detailed information from the target website by an application program running on the user device and store it in the database.

4. The method of claim 3,
The detailed information includes a detailed image of the product related to the basic list configuration information, product size, weight, material, color, and additional description information, a data crawling device using a web archive.

4. The method of claim 3,
The emulator module is:
determining whether a preference rank of a user target product to which the detailed information request is inputted from among the products included in the target website satisfies a preset ranking condition;
collecting detailed information of the user target product from the target website through the user device when the preference ranking of the user target product satisfies the preset ranking condition; and
If the preference ranking of the user target product does not satisfy the preset ranking condition, the data crawling apparatus using a web archive is configured not to collect detailed information of the user target product through the user device.

8. The method according to any one of claims 1 to 7,
The crawling module,
a crawling target analyzer configured to analyze an arrangement pattern of the basic list configuration information for each of the various target web sites, and to designate a crawling target block and properties and fields of the crawling target block based on the arrangement pattern; and
a crawler configured to crawl the basic list configuration information based on the crawling target block and the properties and fields of the crawling target block from the various target websites;
The data crawling device using a web archive, wherein the user target site creation module is configured to integrate product data for various products collected from the various target web sites into a specified object pattern and display it on the user target site.

crawling, by the crawling module, basic list configuration information including title, summary, and creation time in various web sites;
collecting, by a web archive module, past data of the web site from a web archive site that collects and records change data of the web site by time;
storing the basic list configuration information collected from the web site and the past data collected from the web archive site in a database; and
The basic list collected from the various web sites from the basic list configuration information collected from the various web sites and the historical data collected from the web archive site in relation to the various web sites by the user target site creation module A method for crawling data using a web archive, comprising generating a user target site in which configuration information and the past data collected from the web archive site are integrated.

10. The method of claim 9,
When a detailed information request is input from a user device at the user target site by the emulator module, detailed information related to the basic list configuration information is collected from the target website related to the detailed information request through the user device and stored in the database A method of crawling data using a web archive, further comprising the step of storing.

11. The method of claim 10,
The step of collecting the detailed information and storing it in the database includes:
allowing the user device to view the detailed information on the target website;
receiving the detailed information collected by browsing the target website in the user device from the user device; and
Data crawling method using a web archive, comprising the step of linking the detailed information received from the user device with the basic list configuration information to the database.

12. The method of claim 11,
The step of collecting the detailed information and storing it in the database includes:
determining whether the detailed information is stored in the database when the detailed information request is input from the user device;
providing the detailed information stored in the database to the user device through the user target site when the detailed information is stored in the database;
when the detailed information is not stored in the database, executing an application program on the user device to access the target website; and
Data crawling method using a web archive, comprising the step of collecting the detailed information from the target website by an application program running on the user device and storing the detailed information in the database.

11. The method of claim 10,
The step of collecting the detailed information and storing it in the database includes:
determining whether a preference ranking of a user target product to which the detailed information request is inputted from among the products included in the target website satisfies a preset ranking condition; and
When the preference ranking of the user target product satisfies the preset ranking condition, collecting detailed information of the user target product from the target website through the user device,
When the preference ranking of the user target product does not satisfy the preset ranking condition, the data crawling method using a web archive for not collecting detailed information of the user target product through the user device.

10. The method of claim 9,
The steps of crawling the basic list configuration information include:
analyzing an arrangement pattern of the basic list configuration information for each of the various target web sites, and designating a crawling target block and properties and fields of the crawling target block based on the arrangement pattern; and
Crawling the basic list configuration information from the various target websites based on the crawling target block and the properties and fields of the crawling target block,
The step of creating the user target site includes integrating product data for various products collected from the various target web sites into a designated object pattern and displaying it on the user target site. Data crawling method using a web archive .

A computer-readable recording medium in which a program for executing the data crawling method utilizing the web archive according to any one of claims 9 to 14 is recorded.