KR20200022289A

KR20200022289A - System and method for scraping based on web browser

Info

Publication number: KR20200022289A
Application number: KR1020180098245A
Authority: KR
Inventors: 박영준
Original assignee: 주식회사 핑거
Priority date: 2018-08-22
Filing date: 2018-08-22
Publication date: 2020-03-03
Also published as: KR102179792B1; WO2020040556A1

Abstract

The present invention relates to a web browser-based scraping system and a method thereof. The web browser-based scraping system includes: a web browser installed in a client device, storing a user certificate for access to a target server, and receiving a scraping request of a user; a scraping engine receiving the scraping request from the web browser; a relay server generating a preamble satisfying requirements of the target server for scraping; a plurality of security gateways receiving the preamble and the scraping request from the relay server, accessing and scraping the target server with an IP different from the IP of the relay server, delivering scrapped data to the relay server, and having different IPs; and a scraping management server (SMS) managing the operation state and IPs of the security gateways. The security gateways transmit a use-ready state and own IP information to the scraping management server when actuated. The relay server requests the scraping management server for the IPs of the security gateways when receiving the scraping request from the web browser. The scraping management server selects one of the security gateways to deliver the IP of the selected security gateway to the relay server. The scraping engine delivers scraping request information including a service script and certificate information of the client to the relay server.

Description

Web browser-based scraping system and method {SYSTEM AND METHOD FOR SCRAPING BASED ON WEB BROWSER}

본 발명은 스크래핑에 관한 것으로서, 특히 웹 브라우저 기반 스크래핑 시스템 및 방법에 관한 것이다.TECHNICAL FIELD The present invention relates to scraping, and more particularly, to a web browser-based scraping system and method.

스크래핑(scraping)이란 네트워크로 연결된 인터넷 시스템에 자동으로 접속해 데이터를 화면에 나타낸 후 필요한 데이터만을 추출하도록 만들어진 기술 또는 프로그램으로, 웹 사이트나 프로그램의 정보를 수집한 후 다른 프로그램이나 데이터베이스에 저장하여 필요할 때마다 조회하거나 사용할 수 있으며 저장된 데이터를 비교 분석하여 새로운 데이터를 생성할 수도 있다.Scraping is a technology or program designed to automatically connect to a networked Internet system to display data on a screen and then extract only the data you need. It collects information about a website or program and stores it in another program or database. It can be viewed or used at any time, and new data can be generated by comparing and analyzing stored data.

이러한 스크래핑 기술을 이용하는 분야로는 각 금융 기관에 흩어져 있는 금융 자산을 통합해 한 번에 조회, 이체 등의 거래를 관리하는 계좌 통합 관리 분야, 여러 웹 메일을 사용하는 경우 한 번에 모든 메일을 종합적으로 확인할 수 있는 이메일 통합 조회 분야 등이 있다. These scraping technologies are used to consolidate financial assets scattered in each financial institution, to manage transactions such as retrieval and transfer at once, and to integrate all mail at once when using multiple web mails. E-mail integrated inquiry can be found.

종래에는 두 가지 방식으로 스크래핑을 수행하였다. 첫째는 PC 스크래핑으로 PC에 별도의 스크래핑 모듈을 설치해서 스크래핑 모듈을 통해 스크래핑을 수행하여 결과를 받았다. PC 스크래핑은 인증서가 PC에 저장되어 있다. 둘째는 모바일 앱 스크래핑으로 모바일 앱에서 정보를 보내서 결과를 받는다. 모바일 앱 스크래핑은 인증서가 모바일 기기에 저장되어 있다.Conventionally, scraping was performed in two ways. First, PC scraping installed a separate scraping module on the PC and scraping through the scraping module received the results. In PC scraping, the certificate is stored on the PC. The second is mobile app scraping, which sends information from the mobile app and receives the results. Mobile app scraping has a certificate stored on the mobile device.

PC 스크래핑은 스크래핑 모듈을 PC에 설치하고 인증서도 PC에 저장되어 있어야 하며, 모바일 기기(예: 스마트 폰)를 이용해 스크래핑을 하기 위해서는 별도로 스크래핑을 위한 앱을 모바일에 설치해서 인증서도 모바일 기기에 저장해서 스크래핑 하는 불편함이 있었다. PC scraping requires that the scraping module is installed on the PC and the certificate is stored on the PC.In order to scrape using a mobile device (for example, a smartphone), a separate app for scraping is installed on the mobile and the certificate is stored on the mobile device. There was inconvenience of scraping.

등록특허공보 제10-1815235호(2017.12.28)Patent Registration No. 10-1815235 (2017.12.28)

본 발명이 해결하고자 하는 과제는 상술한 문제점을 해결하기 위해 창출된 것으로서, 다양한 고객별 수요와 변화를 반영할 수 있도록 운영체제 제약이 없는 표준 웹 기반의 유연한 서비스를 제공할 수 있으며, 고객 시스템 내에 앱, PC모듈, 서버 등 별도의 설치 없이 쉽게 적용할 수 있는, 웹 브라우저 기반 스크래핑 시스템 및 방법을 제공하는 것이다.The problem to be solved by the present invention was created to solve the above-described problems, can provide a standard web-based flexible service without operating system constraints to reflect various customer demands and changes, the app within the customer system To provide a web browser-based scraping system and method that can be easily applied without a separate installation, such as PC module, server.

상기 기술적 과제를 이루기 위한 본 발명에 의한 웹 브라우저 기반 스크래핑 시스템은, 클라이언트 기기에 설치되며, 스크래핑 대상 정보가 있는 타겟서버에 접속할 수 있는 사용자 인증서가 저장되어 있으며, 사용자의 스크래핑 요청을 받아들이는 웹 브라우저; 상기 웹 브라우저로부터 스크래핑 요청을 수신하는 스크래핑 엔진; 스크래핑에 필요한 타겟서버의 요구사항을 만족하는 정보(전문)를 생성하는 중계서버(WSGS); 및 상기 중계서버로부터 스크래핑 요청과 전문을 수신하고, 상기 중계서버의 IP와는 다른 IP 로 상기 타겟서버에 접속하여 스크래핑하고, 스크래핑한 데이터를 상기 중계서버로 전달하며, 각각 다른 IP를 갖는 복수의 보안 게이트웨이(SG); 상기 복수의 보안 게이트웨이의 IP들과 동작 상태를 관리하는 스크래핑 관리서버(SMS)를 포함하고, 상기 보안 게이트웨이는 기동하면 자신의 IP 정보와 사용 준비 완료 상태를 상기 스크래핑 관리서버로 전송하고, 상기 중계서버는 상기 웹브라우저로부터 스크래핑 요청을 수신하면 상기 스크래핑 관리서버에게 보안 게이트웨이의 IP를 요청하며, 상기 스크래핑 관리서버는 상기 복수의 보안 게이트웨이 중 하나를 선택하여 선택된 보안 게이트웨이의 IP를 상기 중계서버로 전달하고, 상기 스크래핑 엔진은 서비스 스크립트와 상기 클라이언트의 인증서 정보가 포함된 스크래핑 요청 정보를 상기 중계서버로 전달하는 것을 특징으로 한다. Web browser-based scraping system according to the present invention for achieving the above technical problem, is installed in the client device, the user certificate for accessing the target server with the information to be scraped is stored, the web browser accepting the user's scraping request ; A scraping engine that receives a scraping request from the web browser; A relay server (WSGS) for generating information (full text) satisfying the requirements of the target server for scraping; And receiving a scraping request and a full message from the relay server, accessing and scraping the target server with an IP different from that of the relay server, and transferring the scraped data to the relay server, each of which has a different IP. Gateway (SG); And a scraping management server (SMS) that manages IPs and operating states of the plurality of security gateways, and when the security gateway is activated, transmits its IP information and ready-to-use status to the scraping management server, and relays When the server receives the scraping request from the web browser, the server requests the IP of the security gateway from the scraping management server, and the scraping management server selects one of the plurality of security gateways and delivers the IP of the selected security gateway to the relay server. And, the scraping engine is characterized in that for transmitting the scraping request information including the service script and the certificate information of the client to the relay server.

본 발명에 의한 웹 브라우저 기반 스크래핑 시스템은, 스크래핑에 필요한 서비스 스크립트와 클라이언트의 스크래핑 라이선스 정보를 관리하는 클라이언트 관리 서버(SSLS)를 더 포함하고, 상기 웹 브라우저는 클라이언트로부터 스크래핑 요청을 받으면 상기 스크래핑 엔진을 통해 상기 클라이언트 관리 서버(SSLS)에게 서비스 스크립트를 요청하고, 상기 클라이언트 관리 서버는 상기 클라이언트의 스크래핑 라이선스를 검증하여 정상 사용자일 경우 상기 스크래핑 엔진에게 서비스 스크립트를 전달하는 것을 특징으로 한다. 상기 중계서버, 복수의 보안 게이트웨이, 스크래핑 관리서버 및 클라이언트 관리서버는 클라우드 웹서비스 플랫폼에서 제공되는 것을 특징으로 한다. 상기 복수의 보안 게이트웨이의 IP는 상기 클라우드 웹서비스 플랫폼에 의해 제공되는 것을 특징으로 한다. 상기 복수의 보안 게이트웨이는 비동기적 연결을 제공하며, 구간 간(end-to-end) 통신시 직접 타겟서버에 접속한 것과 동일한 보안 수준을 제공하며, SSL(Secure Socket Layer) 통신시 세션(session) 중간에 복호화하지 않고 암호문 상태를 유지하는 것을 특징으로 한다.The web browser-based scraping system according to the present invention further includes a client management server (SSLS) that manages a service script and scraping license information of a client required for scraping, and the web browser receives the scraping engine upon receiving a scraping request from a client. Request a service script to the client management server (SSLS) through, and the client management server is characterized in that the service script to pass to the scraping engine if the normal user to verify the scraping license of the client. The relay server, the plurality of security gateways, the scraping management server and the client management server is characterized in that provided in the cloud web service platform. IP of the plurality of security gateways is characterized by being provided by the cloud web service platform. The plurality of security gateways provide an asynchronous connection, provide the same level of security as directly connected to the target server for end-to-end communication, and provide sessions for SSL (Secure Socket Layer) communication. It is characterized by maintaining the ciphertext state without decrypting in the middle.

상기 기술적 과제를 이루기 위한 본 발명에 의한 웹 브라우저 기반 스크래핑 방법은, 웹 브라우저가 클라이언트로부터 타겟서버에 대한 스크래핑 요청을 받으면, 스크래핑 엔진(120)에게 상기 클라이언트의 스크래핑 요청과 사용자 인증서를 전달하는 단계; 스크래핑 엔진은 상기 스크래핑에 대한 서비스 스크립트와 상기 사용자 인증서를 중계서버로 전송하는 단계; 중계서버는 스크래핑에 필요한 타겟서버의 요구사항을 만족하는 정보(전문)를 구성하고, 상기 사용자 인증서, 서비스 스크립트 및 상기 전문을 개인정보 유지 프로토콜(SSL)을 이용하여 보안 게이트웨이에게 전송하는 단계; 상기 보안 게이트웨이는 상기 중계서버의 IP와는 다른 IP를 통해 상기 전문과 사용자 인증서 정보 및 서비스 스크립트를 이용하여 상기 타겟서버에 접속하여 스크래핑하는 단계; 및 상기 웹브라우저는 상기 보안 게이트웨이와 중계서버 및 스크래핑 엔진을 통해 스크래핑한 정보를 수신하는 단계를 포함하고, 상기 중계서버가 상기 타겟서버에 접속할 때 마다 IP가 다른 보안 게이트웨이를 할당 받아 상기 할당 받은 보안 게이트웨이를 통해 상기 타겟 서버에 접속하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a web browser-based scraping method comprising: when a web browser receives a scraping request for a target server from a client, transmitting a scraping request and a user certificate of the client to a scraping engine (120); Sending a service script for the scraping and the user certificate to a relay server by a scraping engine; The relay server constructs information (full text) that satisfies the requirements of the target server for scraping, and transmits the user certificate, service script, and the full text to a secure gateway using a personal information maintaining protocol (SSL); The security gateway accessing and scraping the target server using the full text and user certificate information and a service script through an IP different from the IP of the relay server; And receiving, by the web browser, the scraped information through the security gateway, the relay server, and the scraping engine, and each time the relay server accesses the target server, the IP is assigned a different security gateway. And accessing the target server through a gateway.

본 발명에 의한 웹 브라우저 기반 스크래핑 방법은, 상기 스크래핑 엔진이 상기 웹 브라우저로부터 클라이언트의 스크래핑 요청을 수신하면 클라이언트 관리 서버로에게 서비스 스크립트를 요청하는 단계; 및 상기 클라이언트 관리서버는 상기 클라이언트의 스크래핑 라이선스를 검증하여 정당한 클라이언트이면 서비스 스크립트를 상기 스크래핑 서버에게 제공하는 단계를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, a web browser-based scraping method includes: requesting, by a scraping engine, a service script to a client management server when a scraping request of a client is received from the web browser; And the client management server verifying the scraping license of the client and providing a service script to the scraping server if the client is a legitimate client.

본 발명에 의한 웹 브라우저 기반 스크래핑 방법은, 상기 보안 게이트웨이는 기동하면 자신의 IP와 사용준비 완료 메시지를 스크래핑 관리서버로 전송하는 단계; 상기 스크래핑 관리서버는 상기 보안 게이트웨이의 IP와 사용준비 상태를 저장하고 관리하는 단계; 상기 중계서버는 상기 스크래핑 엔진으로부터 클라이언트의 스크래핑 요청을 수신하면, 상기 스크래핑 관리서버에게 보안 게이트웨이 IP를 요청하는 단계; 상기 스크래핑 관리서버는 보안 게이트웨이의 동작상태를 보고 사용 준비된 보안 게이트웨이의 IP를 상기 중계서버로 전송하는 단계를 더 포함한다.The web browser-based scraping method according to the present invention includes the steps of: when the security gateway is activated, transmitting its IP and ready to use message to the scraping management server; The scraping management server storing and managing an IP and a ready state of use of the security gateway; When the relay server receives a client's scraping request from the scraping engine, requesting a security gateway IP from the scraping management server; The scraping management server further includes the step of viewing the operating state of the security gateway and transmitting the IP of the security gateway ready for use to the relay server.

상기 보안 게이트웨이는 비동기적 연결을 제공하며, 구간 간(end-to-end) 통신시 직접 타겟서버에 접속한 것과 동일한 보안 수준을 제공하며, SSL(Secure Socket Layer) 통신시 세션(session) 중간에 복호화하지 않고 암호문 상태를 유지한다.The secure gateway provides an asynchronous connection, provides the same level of security as directly connected to the target server for end-to-end communication, and in the middle of a session during SSL (Secure Socket Layer) communication. Maintain ciphertext without decrypting.

상기 기술적 과제를 이루기 위한 본 발명에 의한 웹 브라우저 기반 스크래핑 시스템은, 클라이언트 기기에 설치되며, 스크래핑 대상 정보가 있는 타겟서버에 접속할 수 있는 사용자 인증서가 저장되어 있으며, 사용자의 스크래핑 요청을 받아들이는 웹 브라우저; 상기 웹 브라우저로부터 스크래핑 요청을 수신하는 스크래핑 엔진; 스크래핑에 필요한 타겟서버의 요구사항을 만족하는 정보(전문)를 생성하는 중계서버(WSGS); 상기 중계서버로부터 스크래핑 요청과 전문을 수신하고, 상기 중계서버의 IP와는 다른 IP 로 상기 타겟서버에 접속하여 스크래핑하고, 스크래핑한 데이터를 상기 중계서버로 전달하며, 각각 다른 IP를 갖는 복수의 보안 게이트웨이(SG); 및 상기 복수의 보안 게이트웨이의 IP들과 동작 상태를 관리하는 고객사 서버를 포함하고, 상기 보안 게이트웨이는 기동하면 자신의 IP 정보와 사용 준비 완료 상태를 상기 고객사 서버로 전송하고, 상기 중계서버는 상기 웹브라우저로부터 스크래핑 요청을 수신하면 상기 고객사서버에게 보안 게이트웨이의 IP를 요청하며, 상기 고객사서버는 상기 복수의 보안 게이트웨이 중 사용준비 상태인 하나를 선택하여 선택된 보안 게이트웨이의 IP를 상기 중계서버로 전달하고, 상기 스크래핑 엔진은 서비스 스크립트와 상기 클라이언트의 인증서 정보가 포함된 스크래핑 요청 정보를 상기 중계서버로 전달한다.Web browser-based scraping system according to the present invention for achieving the above technical problem, is installed in the client device, the user certificate for accessing the target server with the information to be scraped is stored, the web browser accepting the user's scraping request ; A scraping engine that receives a scraping request from the web browser; A relay server (WSGS) for generating information (full text) satisfying the requirements of the target server for scraping; Receives a scraping request and a full message from the relay server, accesses and scrapes the target server with an IP different from that of the relay server, transfers the scraped data to the relay server, and each of the plurality of security gateways having different IPs. (SG); And a client company server managing IPs and operation states of the plurality of security gateways. The security gateway transmits its IP information and a ready-to-use state to the client company server when the security gateway is activated, and the relay server transmits the web. Upon receiving a scraping request from the browser, the client server requests the IP of the security gateway, and the client server selects one of the plurality of security gateways in a ready state, and delivers the IP of the selected security gateway to the relay server. The scraping engine delivers the scraping request information including the service script and the certificate information of the client to the relay server.

본 발명에 따른 웹 브라우저 기반 스크래핑 시스템 및 방법에 의하면, 고객사의 서비스 앱에 제한 없이 사용할 수 있고, HTML 5를 지원하는 브라우저라면 단말의 종류에 제한없이 실행할 수 있다.According to the web browser-based scraping system and method according to the present invention, it can be used without limitation in the service app of the customer company, and if the browser supports HTML 5 can be executed without limitation on the type of the terminal.

그리고 본 발명에 의하면, 하나의 개별 언어로 구성되었고, 별도의 앱 설치나 업데이트(update) 없이 스크립트 적용만으로 사용할 수 있고 유지보수에 편리하다.And according to the present invention, it is composed of one individual language, can be used only by applying a script without a separate app installation or update (update) and is convenient for maintenance.

또한 대상기관이나 데이터 종류의 추가 변경이 자유롭고 모듈 설치 등 별도의 작업없이 쉽게 적용 가능해 서비스 확장이 용이하다.In addition, it is easy to expand the service by freely changing additional target organizations or data types and easily applying it without additional work such as module installation.

도 1은 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 구성에 대한 일실시예를 블록도로 나타낸 것이다.
도 2는 본 발명에 따른 웹 브라우저 기반 스크래핑 방법에 대한 일실시예를 타이밍도로 나타낸 것이다.
도 3은 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 구성에 대한 다른 실시예를 블록도로 나타낸 것이다.
도 4 및 도 5는 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 제1실시예의 전체 구성도이다.
도 6 및 도 7은 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 제2실시예의 전체 구성도이다.
도 8은 HTML 5 기반의 클라이언트의 구성을 블록도로 나타낸 것이다.
도 9는 구간 암호화(E2E, End-to-End) 측면에서 Secure Gateway(SG)와 Proxy를 비교한 것
도 10은 스크래핑 과정에서 중계서버(WSGS), 보안 게이트웨이(SG) 및 스크래핑 관리서버(SMS) 간의 통신 절차를 나타낸 것이다.1 is a block diagram showing an embodiment of the configuration of a web browser-based scraping system according to the present invention.
2 is a timing diagram illustrating an embodiment of a web browser-based scraping method according to the present invention.
3 is a block diagram illustrating another embodiment of the configuration of the web browser based scraping system according to the present invention.
4 and 5 are overall configuration diagrams of a first embodiment of a web browser based scraping system according to the present invention.
6 and 7 are overall configuration diagrams of a second embodiment of a web browser-based scraping system according to the present invention.
8 is a block diagram illustrating a configuration of an HTML 5 based client.
9 is a comparison between Secure Gateway (SG) and Proxy in terms of end-to-end (E2E) encryption.
10 illustrates a communication procedure between the relay server WSGS, the security gateway SG, and the scraping management server SMS in the scraping process.

이하, 첨부된 도면을 참조로 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 바람직한 일 실시예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원 시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Configurations shown in the embodiments and drawings described herein are only one preferred embodiment of the present invention, and do not represent all of the technical spirit of the present invention, various equivalents that may be substituted for them at the time of the present application It should be understood that there may be variations and variations.

도 1은 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 구성에 대한 일실시예를 블록도로 나타낸 것이다. 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 구성에 대한 일실시예는 웹브라우저(110), 스크래핑 엔진(120), 중계서버(WSGS, 130), 복수의 보안 게이트웨이(140, 145) 및 스크래핑 관리서버(SMS, 150)를 포함하여 이루어진다. 그리고 클라이언트 관리 서버(SSLS, 160)를 더 포함할 수 있다.1 is a block diagram showing an embodiment of the configuration of a web browser-based scraping system according to the present invention. One embodiment of the configuration of the web browser-based scraping system according to the present invention is a web browser 110, scraping engine 120, relay server (WSGS, 130), a plurality of security gateways (140, 145) and scraping management server (SMS, 150). The server may further include a client management server (SSLS) 160.

웹브라우저(110)는 클라이언트 기기(미도시)에 설치되며, 스크래핑 대상 정보가 있는 타겟서버(170, 175)에 접속할 수 있는 사용자 인증서가 저장되어 있으며, 클라이언트로부터 스크래핑 요청을 받아들인다. 웹 브라우저(110)는 클라이언트로부터 스크래핑 요청을 받으면 스크래핑 엔진(120)을 통해 클라이언트 관리 서버(SSLS, 160)에게 서비스 스크립트를 요청한다. The web browser 110 is installed in a client device (not shown), and stores a user certificate for accessing the target servers 170 and 175 having the scraping target information, and receives a scraping request from the client. When the web browser 110 receives a scraping request from the client, the web browser 110 requests a service script from the client management server (SSLS) 160 through the scraping engine 120.

스크래핑 엔진(120)는 웹 브라우저(110)로부터 스크래핑 요청을 수신한다.The scraping engine 120 receives a scraping request from the web browser 110.

중계서버(WSGS, 130)는 스크래핑에 필요한 타겟서버(170)의 요구사항을 만족하는 정보(전문)를 생성한다. 중계서버(130)는 웹브라우저(110)로부터 스크래핑 요청을 수신하면 스크래핑 관리서버(150)에게 보안 게이트웨이(SG)의 IP를 요청한다. The relay server WSGS 130 generates information (full text) satisfying the requirements of the target server 170 necessary for scraping. When the relay server 130 receives the scraping request from the web browser 110, the relay server 130 requests the IP of the security gateway (SG) from the scraping management server 150.

복수의 보안 게이트웨이(SG, 140, 145)는 중계서버(130)로부터 스크래핑 요청과 전문을 수신하고, 중계서버(130)의 IP와는 다른 IP 로 타겟서버(170)에 접속하여 스크래핑하고, 스크래핑한 데이터를 중계서버(130)로 전달하며, 각각 다른 IP를 갖는다. 보안 게이트웨이(140)는 기동하면 자신의 IP 정보와 사용 준비 완료 상태를 상기 스크래핑 관리서버(150)로 전송한다. 복수의 보안 게이트웨이(140, 145)의 IP는 클라우드 웹서비스 플랫폼에 의해 제공될 수 있다. 또한 복수의 보안 게이트웨이(140, 145)는 비동기적 연결을 제공하며, 구간 간(end-to-end) 통신시 직접 타겟서버(170)에 접속한 것과 동일한 보안 수준을 제공하며, SSL(Secure Socket Layer) 통신시 세션(session) 중간에 복호화하지 않고 암호문 상태를 유지할 수 있다.The plurality of security gateways SG, 140, and 145 receive the scraping request and the full text from the relay server 130, access the target server 170 with an IP different from that of the relay server 130, and then scrape and scrape. The data is transmitted to the relay server 130, and each has a different IP. When the security gateway 140 starts, it transmits its IP information and ready to use state to the scraping management server 150. IPs of the plurality of security gateways 140 and 145 may be provided by a cloud web service platform. In addition, the plurality of security gateways 140 and 145 provide an asynchronous connection, provide the same level of security as that directly connected to the target server 170 during end-to-end communication, and SSL (Secure Socket) Layer) During communication, a ciphertext state can be maintained without decrypting in the middle of a session.

스크래핑 관리서버(SMS, 150)는 복수의 보안 게이트웨이(140, 145)의 IP들과 동작 상태를 관리하며, 보안 게이트웨이(140)로부터 IP 정보와 사용준비 완료 상태를 수신하면 저장한다. 스크래핑 관리서버(150)는 중계서버(130)가 보안 게이트웨이(140, 145)의 IP를 요청하면 복수의 보안 게이트웨이 중 하나를 선택하여 선택된 보안 게이트웨이의 IP를 중계서버(130)로 전달한다. 이 때, 스크래핑 관리서버(SMS, 150)는 스크래핑 요청 및 인스턴스 사용량(SG당 콜 수)에 따라 보안 게이트웨이의 IP를 재할당 할 수 있다. 그리고 스크래핑 엔진(120)은 서비스 스크립트와 상기 클라이언트의 인증서 정보가 포함된 스크래핑 요청 정보를 중계서버(130)로 전달한다.The scraping management server (SMS) 150 manages IPs and operating states of the plurality of security gateways 140 and 145, and stores the IP information and the ready to use state from the security gateway 140. When the relay server 130 requests the IPs of the security gateways 140 and 145, the scraping management server 150 selects one of the plurality of security gateways and transfers the IP of the selected security gateway to the relay server 130. At this time, the scraping management server (SMS) 150 may reassign the IP of the security gateway according to the scraping request and the instance usage (number of calls per SG). The scraping engine 120 transmits the scraping request information including the service script and the certificate information of the client to the relay server 130.

클라이언트 관리 서버(SSLS, 160)는 스크래핑에 필요한 서비스 스크립트와 클라이언트의 스크래핑 라이선스 정보를 관리한다. 클라이언트 관리 서버(160)는 상기 클라이언트의 스크래핑 라이선스를 검증하여 정상 사용자일 경우 스크래핑 엔진(120)에게 서비스 스크립트를 전달한다. The client management server (SSLS) 160 manages a service script required for scraping and scraping license information of the client. The client management server 160 verifies the scraping license of the client and delivers the service script to the scraping engine 120 when the client is a normal user.

중계서버(130), 복수의 보안 게이트웨이(140, 145), 스크래핑 관리서버(150) 및 클라이언트 관리서버(160)는 클라우드 웹서비스 플랫폼, 예를 들어 클라우드(Cloud)가 제공할 수 있다.The relay server 130, the plurality of security gateways 140 and 145, the scraping management server 150, and the client management server 160 may be provided by a cloud web service platform, for example, a cloud.

도 2는 본 발명에 따른 웹 브라우저 기반 스크래핑 방법에 대한 일실시예를 타이밍도로 나타낸 것이다. 웹 브라우저(110)가 클라이언트(105)로부터 타겟서버(170)에 대한 스크래핑 요청을 받으면(S200단계), 스크래핑 엔진(120)에게 클라이언트(105)의 스크래핑 요청과 사용자 인증서를 전달한다.(S205단계) 스크래핑 엔진(120)이 웹 브라우저(110)로부터 클라이언트의 스크래핑 요청을 수신하면 클라이언트 관리 서버(160)에게 서비스 스크립트를 요청한다.(S210단계) 클라이언트 관리서버(160)는 상기 클라이언트의 스크래핑 라이선스를 검증하여(S215단계) 정당한 클라이언트이면 서비스 스크립트를 스크래핑 엔진(120)에게 제공한다.(S220단계)2 is a timing diagram illustrating an embodiment of a web browser-based scraping method according to the present invention. When the web browser 110 receives the scraping request for the target server 170 from the client 105 (step S200), the web browser 110 transmits the scraping request and the user certificate of the client 105 to the scraping engine 120 (step S205). When the scraping engine 120 receives the client's scraping request from the web browser 110, the scraping engine 120 requests a service script from the client management server 160 (step S210). The client management server 160 applies the scraping license of the client. If it is verified (step S215) and a legitimate client, the service script is provided to the scraping engine 120. (step S220).

스크래핑 엔진(120)은 상기 스크래핑에 대한 서비스 스크립트와 상기 사용자 인증서를 중계서버(130)로 전송한다.(S225단계) The scraping engine 120 transmits the service script for the scraping and the user certificate to the relay server 130 (step S225).

한편, 보안 게이트웨이(140)는 기동하면 자신의 IP와 사용준비 완료 메시지를 스크래핑 관리서버(150)로 전송한다.(S212단계) 스크래핑 관리서버(150)는 보안 게이트웨이(140)의 IP와 사용준비 상태를 저장하고 관리한다. On the other hand, the security gateway 140 transmits its own IP and the usage readiness completion message to the scraping management server 150 (step S212). The scraping management server 150 prepares the IP and use of the security gateway 140. Store and manage state

중계서버(130)는 스크래핑 엔진(120)으로부터 클라이언트의 스크래핑 요청을 수신하면, 스크래핑 관리서버(150)에게 보안 게이트웨이 IP를 요청할 수 있다.(S230단계) 스크래핑 관리서버(150)는 보안 게이트웨이(140)의 사용 준비된 보안 게이트웨이의 IP를 중계서버(130)로 전송할 수 있다.(S235단계) 예를 들어, 스크래핑 관리서버(SMS, 150)는 스크래핑 요청 및 인스턴스 사용량(SG당 콜 수)에 따라 보안 게이트웨이의 IP를 재할당 할 수 있다.When the relay server 130 receives the client's scraping request from the scraping engine 120, the relay server 130 may request the security gateway IP from the scraping management server 150. (Step S230) The scraping management server 150 is secure gateway 140 IP address of the prepared security gateway may be transmitted to the relay server 130 (step S235). For example, the scraping management server (SMS) 150 may secure the service according to the scraping request and the instance usage (number of calls per SG). You can reassign the gateway's IP.

중계서버(130)이 미리 보안 게이트웨이(140)의 IP를 알고 있으면, 상기 S230 단계 및 상기 S235 단계는 생략가능하다. If the relay server 130 knows the IP of the security gateway 140 in advance, steps S230 and S235 may be omitted.

중계서버(130)는 스크래핑에 필요한 타겟서버(170)의 요구사항을 만족하는 정보(전문)를 구성하고, 상기 사용자 인증서, 서비스 스크립트 및 상기 전문을 개인정보 유지 프로토콜(SSL)을 이용하여 보안 게이트웨이(140)에게 전송한다.(S240단계) 보안 게이트웨이(170)는 중계서버(130)의 IP와는 다른 IP를 통해 상기 전문과 사용자 인증서 정보 및 서비스 스크립트를 이용하여 타겟서버(170)에 접속하여 스크래핑한다.(S245단계) 여기서, 중계서버(130)는 타겟서버(170)에 접속할 때 마다 IP가 다른 보안 게이트웨이(140)를 할당 받아 상기 할당 받은 보안 게이트웨이(140)를 통해 타겟 서버(170)에 접속한다.The relay server 130 configures information (full text) that satisfies the requirements of the target server 170 for scraping, and secures the user certificate, service script, and the full text using a personal information maintaining protocol (SSL). The security gateway 170 accesses and scrapes the target server 170 using the full text and user certificate information and a service script through an IP different from the IP of the relay server 130 (step S240). (S245) Here, the relay server 130 is assigned to the security gateway 140 having a different IP every time it connects to the target server 170 to the target server 170 through the assigned security gateway 140. Connect.

웹브라우저(110)는 보안 게이트웨이(140)와 중계서버(130) 및 스크래핑 엔진(120)을 통해 스크래핑한 정보를 수신한다.(S250, S255, S260단계)The web browser 110 receives the scraped information through the security gateway 140, the relay server 130, and the scraping engine 120 (steps S250, S255, and S260).

보안 게이트웨이(140)는 비동기적 연결을 제공하며, 구간 간(end-to-end) 통신시 직접 타겟서버(170)에 접속한 것과 동일한 보안 수준을 제공하며, SSL(Secure Socket Layer) 통신시 세션(session) 중간에 복호화하지 않고 암호문 상태를 유지한다. 여기서, 중계서버(130), 복수의 보안 게이트웨이(140, 145), 스크래핑 관리서버(150) 및 클라이언트 관리서버(160)는 클라우드 웹서비스 플랫폼에 의해 제공될 수 있다. 보안 게이트웨이(140)의 IP는 상기 클라우드 웹서비스 플랫폼에서 제공될 수 있다. The secure gateway 140 provides an asynchronous connection, provides the same level of security as directly connected to the target server 170 for end-to-end communication, and sessions during SSL communication. (session) Maintain ciphertext without decrypting in the middle. The relay server 130, the plurality of security gateways 140 and 145, the scraping management server 150, and the client management server 160 may be provided by a cloud web service platform. The IP of the security gateway 140 may be provided in the cloud web service platform.

도 3은 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 구성에 대한 다른 실시예를 블록도로 나타낸 것이다. 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 구성에 대한 다른 실시예는 웹브라우저(310), 스크래핑 엔진(320), 중계서버(WSGS, 330), 복수의 보안 게이트웨이(340, 345) 및 고객사서버(350)를 포함하여 이루어진다. 그리고 클라이언트 관리 서버(SSLS, 360)를 더 포함할 수 있다.3 is a block diagram illustrating another embodiment of the configuration of the web browser based scraping system according to the present invention. Another embodiment of the configuration of the web browser-based scraping system according to the present invention is a web browser 310, scraping engine 320, relay server (WSGS, 330), a plurality of security gateway (340, 345) and customer company server ( 350). And it may further include a client management server (SSLS, 360).

웹 브라우저(310)는 클라이언트 기기(미도시)에 설치되며, 스크래핑 대상 정보가 있는 타겟서버(370, 375)에 접속할 수 있는 사용자 인증서가 저장되어 있으며, 클라이언트로부터 스크래핑 요청을 받아들인다. 웹 브라우저(310)는 클라이언트(사용자)로부터 스크래핑 요청을 받으면 스크래핑 엔진(320)을 통해 클라이언트 관리 서버(SSLS, 360)에게 서비스 스크립트를 요청한다. The web browser 310 is installed in a client device (not shown), and stores a user certificate for accessing target servers 370 and 375 having scraping target information, and receives a scraping request from the client. When the web browser 310 receives a scraping request from the client (user), the web browser 310 requests a service script from the client management server (SSLS) 360 through the scraping engine 320.

스크래핑 엔진(320)는 웹 브라우저(110)로부터 스크래핑 요청을 수신한다. 중계서버(WSGS, 330)는 스크래핑에 필요한 타겟서버(370)의 요구사항을 만족하는 정보(전문)를 생성한다. 중계서버(330)는 웹브라우저(310)로부터 스크래핑 요청을 수신하면 고객사서버(350)에게 보안 게이트웨이의 IP를 요청한다. The scraping engine 320 receives a scraping request from the web browser 110. The relay server WSGS 330 generates information (full text) satisfying the requirements of the target server 370 required for scraping. When the relay server 330 receives a scraping request from the web browser 310, the relay server 330 requests the IP of the security gateway from the client company 350.

복수의 보안 게이트웨이(SG, 340, 345)는 중계서버(330)로부터 스크래핑 요청과 전문을 수신하고, 중계서버(330)의 IP와는 다른 IP 로 타겟서버(370)에 접속하여 스크래핑하고, 스크래핑한 데이터를 중계서버(330)로 전달하며, 각각 다른 IP를 갖는다. 보안 게이트웨이(340)는 기동하면 자신의 IP 정보와 사용 준비 완료 상태를 고객사서버(350)로 전송한다. 복수의 보안 게이트웨이(340, 345)의 IP는 클라우드 웹서비스 플랫폼에 의해 제공될 수 있다. 또한 복수의 보안 게이트웨이(340, 345)는 비동기적 연결을 제공하며, 구간 간(end-to-end) 통신시 직접 타겟서버(170)에 접속한 것과 동일한 보안 수준을 제공하며, SSL(Secure Socket Layer) 통신시 세션(session) 중간에 복호화하지 않고 암호문 상태를 유지할 수 있다.The plurality of security gateways (SG, 340, 345) receives the scraping request and the full text from the relay server 330, accesses and scrapes by accessing the target server 370 with an IP different from that of the relay server 330. The data is transmitted to the relay server 330, and each has a different IP. When the security gateway 340 starts, it transmits its IP information and ready to use state to the customer company server 350. The IPs of the plurality of security gateways 340 and 345 may be provided by the cloud web service platform. In addition, the plurality of security gateways 340 and 345 provide an asynchronous connection, provide the same level of security as directly connected to the target server 170 in end-to-end communication, and SSL (Secure Socket) Layer) During communication, a ciphertext state can be maintained without decrypting in the middle of a session.

고객사서버(SMS, 350)는 복수의 보안 게이트웨이(340, 345)의 IP들과 동작 상태를 관리하며, 보안 게이트웨이(340)로부터 IP 정보와 사용준비 완료 상태를 수신하면 저장한다. 고객사서버(350)는 중계서버(330)가 보안 게이트웨이(340, 345)의 IP를 요청하면 복수의 보안 게이트웨이 중 하나를 선택하여 선택된 보안 게이트웨이의 IP를 중계서버(330)로 전달한다. 이 때, 고객사서버(350)는 스크래핑 요청 및 인스턴스 사용량(SG당 콜 수)에 따라 보안 게이트웨이의 IP를 재할당 할 수 있다. 그리고 스크래핑 엔진(120)은 서비스 스크립트와 상기 클라이언트의 인증서 정보가 포함된 스크래핑 요청 정보를 중계서버(330)로 전달한다.The customer company server (SMS) 350 manages IPs and operating states of the plurality of security gateways 340 and 345, and stores the IP information and the ready to use state from the security gateway 340. When the relay server 330 requests the IPs of the security gateways 340 and 345, the client company server 350 selects one of the plurality of security gateways and transfers the IP of the selected security gateway to the relay server 330. At this time, the client server 350 may reassign the IP of the security gateway according to the scraping request and the instance usage (number of calls per SG). The scraping engine 120 transmits the scraping request information including the service script and the certificate information of the client to the relay server 330.

클라이언트 관리 서버(SSLS, 360)는 스크래핑에 필요한 서비스 스크립트와 클라이언트의 스크래핑 라이선스 정보를 관리한다. 클라이언트 관리 서버(360)는 상기 클라이언트의 스크래핑 라이선스를 고객사 서버(350)을 통해 검증하여 정상 사용자일 경우 스크래핑 엔진(320)에게 서비스 스크립트를 전달한다. The client management server (SSLS) 360 manages a service script required for scraping and scraping license information of the client. The client management server 360 verifies the scraping license of the client through the client company server 350 and delivers a service script to the scraping engine 320 when the client is a normal user.

중계서버(330), 복수의 보안 게이트웨이(340, 345), 고객사서버(350)는 고객사 IDC(Internet Data Center, 300)에 설치될 수 있다. The relay server 330, the plurality of security gateways 340 and 345, and the customer company server 350 may be installed in the customer IDC (Internet Data Center) 300.

한편, 본 발명은 모바일 기기, 예를 들어 스마트폰에 설치된 웹 브라우저를 통해 스크래핑을 진행(로그인, 인증서)하여 타겟기관의 서버를 스크래핑하여 그 결과를 수신한다. 타겟기관은 예를 들면 국세청, 건강보험공단, 현금영수증, 부동산 정보, 통신사, SNS 등이 될 수 있다. 본 발명에 사용되는 웹 브라우저는 크롬(Chrome, safari 등 HTML5 기반 브라우저가 될 수 있으며, 개발시 언어는 JavaScript 가 될 수 있고, 앱을 설치할 필요 없다.On the other hand, the present invention by scraping (login, certificate) through a web browser installed on a mobile device, for example, a smartphone scrapes the server of the target organization and receives the result. Target institutions may be, for example, the IRS, the Health Insurance Corporation, cash receipts, real estate information, telecommunications companies, and SNS. The web browser used in the present invention may be an HTML5-based browser such as Chrome (Chrome, safari), and the language may be JavaScript at development time, and there is no need to install an app.

이하, 본 발명에서 사용되는 용어를 간략히 설명하면 표 1과 같다.Hereinafter, the terms used in the present invention will be briefly described in Table 1.

표 1에서 WSGS는 통신중계서로서, 대상기관 즉 타겟 서버와 E2E(End-to-End, 종단간 암호화)를 시작한다. In Table 1, WSGS is a communication relay station that initiates E2E (end-to-end) with the target organization, the target server.

도 4 및 도 5는 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 전체 구성도에 대한 제1실시예로서, 클라우드(Cloud)를 이용하여 구성한 것이다. 도 4 및 도 5에서 핑거(Finger)는 스크래핑 서비스 관리 회사의 이름이다. 본 발명에서는 개인 정보를 보호하기 위한 개인정보 유지 프로토콜인 SSL(Secure Socket Layer)을 사용하여 개인정보를 보호한다. 그리고 IP가 서로 다른 SG1 ~ SG n 을 두어 타겟 기관의 서버에서 볼 때 IP가 다양하게 보이도록 한다. SG(Secure Gateway)는 타겟 기관과 HTTP Request Get / Post Method 방식으로 HTTP Response Html File 을 송수신한다. 4 and 5 is a first embodiment of the overall configuration diagram of a web browser-based scraping system according to the present invention, it is configured using a cloud (Cloud). 4 and 5, the finger is the name of the scraping service management company. The present invention protects personal information using SSL (Secure Socket Layer) which is a personal information maintenance protocol for protecting personal information. In addition, SG1 ~ SG n with different IPs are placed so that the IPs look different when viewed from the server of the target authority. SG (Secure Gateway) sends and receives HTTP Response Html File with target organization by HTTP Request Get / Post Method.

도 5를 참조하면, WSGS, SMS, SSLS, SG, SMDB는 클라우드(Cloud)가 제공한다. SG 기동 시 자신의 IP 정보와 함께 사용 준비가 완료되었다고 SMS Update API 호출한다. SMS 가 SMDB로 SG 정보를 Update 한다. Client에서 통신 요청 시 WSGS 가 어떤 SG로 통신을 요청해야 하는지 SG Search API를 통해 SMS에게 확인한다. Client의 통신 요청 내용에 맞게 WSGS 는 전문을 구성하고 선택된 SG를 경유해 대상 기관의 서버와 통신한다.Referring to Figure 5, WSGS, SMS, SSLS, SG, SMDB is provided by the cloud (Cloud). When SG starts up, it calls SMS Update API that it is ready to use with its IP information. SMS updates SG information to SMDB. When the client requests communication, it checks with SG Search API through which SGS the WSGS should request communication. According to the client's communication request, WSGS configures the full text and communicates with the server of the target organization via the selected SG.

SMS의 IP할당 프로세스는 다음과 같이 이루어진다. SMS는 각 SG의 IP 및 상태를 관리한다. SMS는 스크래핑 요청 및 인스턴스 사용량(SG당 콜수)에 따라 SG IP를 재할당 한다The IP allocation process of SMS is performed as follows. SMS manages the IP and status of each SG. SMS reallocates SG IP based on scraping request and instance usage (calls per SG)

Client(사용자)는 서비스 화면에서 Scraping Library(스크래핑 엔진)에게 스크래핑을 요청한다. SSLS에 서비스 스크립트를 요청한다.(서비스명, 라이선스키, 기타 정보) SSLS는 라이선스 키를 SMS를 통해 검증한다. SMS는 라이선스 검증 결과가 정상 사용자일 경우 클라이언트로 스크립트 전달한다. Scraping Labrary(스트래핑 엔진)은 스크립트를 분석하여 스크래핑 서비스를 수행하고 스크래핑 결과를 수신한다. 스크래핑 결과를 구성하고 웹 브라우저를 통해 서비스화면에 전달한다. The Client requests scraping from the Scraping Library on the service screen. Request a service script from the SSLS (service name, license key, and other information). SSLS verifies the license key via SMS. The SMS delivers the script to the client when the license verification result is a normal user. Scraping Labrary analyzes scripts to perform scraping services and receive scraping results. Configure scraping result and deliver it to service screen through web browser.

Client(관리자)는 서버정보, 고객사 정보, 성공률 등 각종 정보를 보여주는 관리자 페이지이다. Client(개발자)는 스크래핑 개발자가 스크립트 개발하고 SSLS로 스크립트를 업로드한다.Client (Administrator) is a manager page that shows various information such as server information, customer information, success rate. Client (developer) develops script by scraping developer and uploads script to SSLS.

도 6 및 도 7은 본 발명에 따른 웹 브라우저 기반 스크래핑 시스템의 전체 구성도에 대한 제2실시예로서, 클라우드(Cloud)를 이용하여 구성한 것이다. 도 6 및 도 7에서 핑거(Finger)는 스크래핑 서비스 관리 회사의 이름이다. 본 발명에서는 개인 정보를 보호하기 위한 개인정보 유지 프로토콜인 SSL(Secure Socket Layer)을 사용하여 개인정보를 보호한다. 그리고 IP가 서로 다른 SG1 ~ SG n 을 두어 타겟 기관의 서버에서 볼 때 IP가 다양하게 보이도록 한다. SG(Secure Gateway)는 타겟 기관과 HTTP Request Get / Post Method 방식으로 HTTP Response Html File 을 송수신한다. 6 and 7 are second embodiments of the overall configuration diagram of the web browser-based scraping system according to the present invention, and are constructed using a cloud. 6 and 7 is the name of the scraping service management company. The present invention protects personal information using SSL (Secure Socket Layer) which is a personal information maintenance protocol for protecting personal information. In addition, SG1 ~ SG n with different IPs are placed so that the IPs look different when viewed from the server of the target authority. SG (Secure Gateway) sends and receives HTTP Response Html File with target organization by HTTP Request Get / Post Method.

도 7을 참조하면, Customer(고객사) IDC에 있는 SG는 기동 시 자신의 IP 정보와 함께 사용 준비가 완료되었다고 고객사서버의 Update API를 호출한다. 고객사서버는 고객사DB로 SG 정보를 Update한다. Client가 통신을 요청할 때 WSGS 는 어떤 SG로 통신을 요청해야 하는지 SG Search API를 통해 고객사 서버에게 확인한다. WSGS는 Client의 통신 요청 내용에 맞게 전문을 구성하고 선택된 SG를 경유해 대상기관(서버)와 통신한다. Referring to FIG. 7, the SG in the Customer IDC calls the Update API of the customer server that it is ready for use with its IP information at startup. The client server updates the SG information with the client database. When the client requests communication, WSGS checks with the client server through the SG Search API which SG the communication request should be made. WSGS composes the full text according to the client's communication request and communicates with the target organization (server) via the selected SG.

고객사(Custom) 서버의 SG IP 할당 프로세스는 다음과 같이 이루어진다. 고객사(Customer) 서버는 각 SG의 IP 및 상태를 관리한다. 고객사(Customer) 서버는 스크래핑 요청 및 인스턴스 사용량(SG당 콜수 MAX 150)에 따라 SG IP를 재 할당한다. Finger 서버에 속한 SSLS는 라이선스를 검증하고 스크립트를 전달한다. SMS는 로그 적재 및 현황 리포트를 담당한다. Client(사용자)는 서비스 화면에서 Scraping Library에게 스크래핑을 요청한다. Scraping Library는 SSLS에게 서비스 스크립트를 요청한다.(서비스명, 라이선스키, 기타정보) SSLS는 라이선스 키를 SMS를 통해 검증한다. SMS에서 라이선스 검증 결과가 정상 사용자일 경우 SSLS는 클라이언트로 스크립트 전달한다. Scraping Library(스크래핑 엔진)은 스크립트를 분석하여 스크래핑 서비스를 수행하고, 수행된 스크래핑 결과를 구성하여 웹 브라우저의 서비스 화면에 전달한다. The SG IP allocation process of the customer server is performed as follows. The customer server manages the IP and status of each SG. The customer server reassigns the SG IP based on scraping requests and instance usage (MAX 150 calls per SG). SSLS belonging to Finger server validates license and delivers script. SMS is responsible for log loading and status reports. Client requests scraping from Scraping Library on service screen. The Scraping Library requests the service script from the SSLS (service name, license key, and other information). The SSLS verifies the license key via SMS. If the license verification result in SMS is a normal user, SSLS sends the script to the client. Scraping Library (Scraping Engine) analyzes the script to perform the scraping service, configures the result of the scraping and delivers to the service screen of the web browser.

Client(관리자)는 서버정보, 고객사 정보, 성공률 등 각종 정보를 보여주는 관리자 페이지를 관리한다. Client(개발자)에서는 스크래핑 개발자가 스크립트를 개발하고, SSLS로 스크립트를 업로드 한다.Client (administrator) manages the administrator page showing various information such as server information, customer information, success rate. In the client, the scraping developer develops the script and uploads it to SSLS.

도 8은 HTML 5 기반의 클라이언트의 구성을 블록도로 나타낸 것이다. 도 8을 참조하면, 고객사 페이지에서 스크래핑 요청 값을 구성하여 Scraping Library 의 Standard API를 통해 호출한다. Standard API를 통해 입력 받은 입력값 중 서비스 종류, 고객사 라이선스 등을 클라이언트 관리서버(SSLS)로 전송한다. 클라이언트 관리서버는 라이선스를 검증한 후 서비스 스크립트를 리턴 받아 스크래핑 엔진에서 실행한다. 서비스 스크립트 수행 과정에서 필요에 따라 Crypto/PKI/Net/Common 등의 외부 라이브러리 또는 Native 기능을 사용 할 수 있다. 수집 결과는 고객사의 요청에 따라 직접 브라우저로 결과를 주거나 Customizing I/F 를 통해 고객사 서버로 결과를 전송 할 수 있다.8 is a block diagram illustrating a configuration of an HTML 5 based client. Referring to FIG. 8, a scraping request value is configured on a customer page and called through a standard API of the scraping library. The service type, customer license, etc. among the input values received through the standard API are transmitted to the client management server (SSLS). After verifying the license, the client management server receives the service script and executes it on the scraping engine. When executing service script, external library such as Crypto / PKI / Net / Common or Native function can be used as needed. The collected results can be sent directly to the browser according to the client's request or sent to the client's server through customizing I / F.

도 9는 구간 암호화(E2E, End-to-End) 측면에서 Secure Gateway(SG)와 Proxy를 비교한 것으로서, Secure Gateway는 비동기적 연결을 제공하여 구간 간 통신 시 직접 서버에 접속한 것과 동일한 보안 수준을 제공한다. 특히 SSL(TLS) 통신 시 세션 중간에 복호화 하지 않고 완벽한 E2E 연결 기능을 제공한다. 즉, Proxy는 복호화하여 평문 상태로 되지만, Secure Gateway는 암호문 상태를 유지함으로써, 개인정보를 보다 확실하게 보호할 수 있다.9 is a comparison between Secure Gateway (SG) and Proxy in terms of segment encryption (E2E, End-to-End). Secure Gateway provides an asynchronous connection so that the same security level is directly connected to the server when communicating between segments. To provide. In particular, SSL (TLS) communication provides full E2E connection without decrypting in the middle of the session. In other words, while the Proxy decrypts to a plain text state, the Secure Gateway maintains a ciphertext state, so that personal information can be more reliably protected.

표 2는 Proxy 서버와 Secure Gateway을 비교하여 장단점을 나타낸 것이다.Table 2 shows the advantages and disadvantages of comparing Proxy Server with Secure Gateway.

도 10은 스크래핑 과정에서 중계서버(WSGS), 보안 게이트웨이(SG) 및 스크래핑 관리서버(SMS) 간의 통신 절차를 나타낸 것이다. 도 10을 참조하면, 중계서버(WSGS, 1010)가 통신 중계를 하는 보안 게이트웨이(SG, 1030)의 IP 를 스크래핑 관리서버(SMS, 1020)에게 요청하면, 스크래핑 관리서버(SMS, 1020)가 중계서버(WSGS, 1010)에게 SG의 IP를 제공한다. 중계서버(1010)는 할당받은 SG의 IP를 이용하여 SG(1030)에게 스크래핑 요청을 전달하고, SG(1030)는 타겟서버(미도시)에서 스크래핑 한 스크래핑한 데이터를 수신한다. 이 때 SMS(1020)는 보안 게이트웨이(1030)의 사용량을 체크하여 다음에 IP를 할당할 때 참조한다.10 illustrates a communication procedure between the relay server WSGS, the security gateway SG, and the scraping management server SMS in the scraping process. Referring to FIG. 10, when the relay server WSGS 1010 requests the IP of the security gateway SG 1030 for communication relaying to the scraping management server SMS 1020, the scraping management server SMS 2020 relays the relay. Provide the server (WSGS, 1010) with the IP of the SG. The relay server 1010 transmits a scraping request to the SG 1030 using the assigned SG IP, and the SG 1030 receives the scraped data scraped from the target server (not shown). At this time, the SMS 1020 checks the usage amount of the security gateway 1030 and references it when assigning an IP next time.

본 발명은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터(정보 처리 기능을 갖는 장치를 모두 포함한다)가 읽을 수 있는 코드로서 구현될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 장치의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있다. 또한, 본 명세서에서, “부”는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.The present invention can be embodied as code that can be read by a computer (including all devices having an information processing function) on a computer-readable recording medium. Computer-readable recording media include all kinds of recording devices that store data that can be read by a computer system. Examples of computer-readable recording devices include ROM, RAM, CD-ROM, magnetic tape, floppy disks, optical data storage devices, and the like. In addition, in this specification, “unit” may be a hardware component such as a processor or a circuit, and / or a software component executed by a hardware component such as a processor.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

100 : 클라우드 서비스 105 : 클라이언트
110 : 웹 브라우저 120 : 스크래핑 엔진
130 : 중계서버(WSGS) 140 : 제1보안 게이트웨이(SG1)
145 : 제2보안 게이트웨이(SG2) 150 : 스크래핑 관리서버(SMS)
160 : 클라이언트 관리서버(SSLS) 170 : 타겟서버 1
175 : 타겟서버 2 310 : 웹 브라우저
320 : 스크래핑 엔진 330 : 중계서버(WSGS)
340 : 제1보안 게이트웨이(SG1) 345 : 제2보안 게이트웨이(SG2)
350 : 고객사서버 360 : 클라이언트 관리서버(SSLS)
370 : 타겟서버 1 375 : 타겟서버 2
1010 : 중계서버(WSGS) 1020 :스크래핑 관리서버(SMS)
1030 : 보안 게이트웨이(SG)100: cloud service 105: client
110: Web Browser 120: Scraping Engine
130: relay server (WSGS) 140: first security gateway (SG1)
145: second security gateway (SG2) 150: scraping management server (SMS)
160: client management server (SSLS) 170: target server 1
175: target server 2 310: web browser
320: scraping engine 330: relay server (WSGS)
340: first security gateway (SG1) 345: second security gateway (SG2)
350: client server 360: client management server (SSLS)
370: target server 1 375: target server 2
1010: relay server (WSGS) 1020: scraping management server (SMS)
1030: security gateway (SG)

Claims

A web browser installed in the client device and having a user certificate for accessing the target server having the scraping target information, the web browser accepting the user's scraping request;
A scraping engine that receives a scraping request from the web browser;
A relay server (WSGS) for generating information (full text) satisfying the requirements of the target server for scraping; And
Receives a scraping request and a full message from the relay server, accesses and scrapes the target server with an IP different from that of the relay server, transfers the scraped data to the relay server, and each of the plurality of security gateways having different IPs. (SG);
And a scraping management server (SMS) that manages IPs and operating states of the plurality of security gateways,
The security gateway transmits its IP information and ready to use status to the scraping management server upon startup, and the relay server requests the scraping management server's IP from the scraping management server upon receiving a scraping request from the web browser. The scraping management server selects one of the plurality of security gateways to transfer the IP of the selected security gateway to the relay server, and the scraping engine sends the scraping request information including a service script and certificate information of the client to the relay server. Web browser based scraping system, characterized in that delivered to.

The method of claim 1,
It further includes a client management server (SSLS) that manages scraping license information of clients and service scripts required for scraping.
When the web browser receives a scraping request from the client, the web browser requests a service script from the client management server (SSLS) through the scraping engine, and the client management server verifies the client's scraping license to the scraping engine if the user is a normal user. A web browser based scraping system, characterized in that it delivers a service script.

The method of claim 2,
The relay server, the plurality of security gateways, the scraping management server and the client management server is provided in a cloud web services platform, the IP of the plurality of security gateways is provided by the cloud web services platform, web browser based Scraping system.

The method of claim 1, wherein the plurality of security gateways
It provides asynchronous connection, provides the same security level as directly connected to the target server for end-to-end communication, and encrypts the ciphertext without decrypting it in the middle of the session during SSL (Secure Socket Layer) communication. A web browser based scraping system, characterized by maintaining state.

When the web browser receives a scraping request from the client to the target server, transmitting the scraping request and the user certificate of the client to the scraping engine 120;
Sending a service script for the scraping and the user certificate to a relay server by a scraping engine;
The relay server constructs information (full text) that satisfies the requirements of the target server for scraping, and transmits the user certificate, service script, and the full text to a security gateway using a personal information maintaining protocol (SSL);
The security gateway accessing and scraping the target server using the full text and user certificate information and a service script through an IP different from the IP of the relay server; And
The web browser includes receiving the scraped information through the security gateway, the relay server and the scraping engine,
Each time the relay server accesses the target server, a security gateway having a different IP is assigned to access the target server through the assigned security gateway.

The method of claim 5,
Requesting, by the scraping engine, a service script from a web browser to a client management server when the scraping request is received from the web browser; And
And the client management server verifies the scraping license of the client and provides a service script to the scraping server if the client is a legitimate client.

The method of claim 5,
When the security gateway is activated, transmitting its IP and ready to use message to the scraping management server;
The scraping management server storing and managing an IP and a ready state of use of the security gateway;
When the relay server receives a client's scraping request from the scraping engine, requesting a security gateway IP from the scraping management server; And
The scraping management server further comprises the step of viewing the operating state of the security gateway and transmitting the IP of the security gateway ready for use to the relay server, Web browser based scraping method.

The method of claim 5,
The secure gateway provides an asynchronous connection, provides the same level of security as directly connected to the target server for end-to-end communication, and in the middle of a session during SSL (Secure Socket Layer) communication. A web browser based scraping system, characterized in that it maintains a ciphertext state without decryption.

The method according to any one of claims 5 to 8,
The relay server, a plurality of security gateways, scraping management server and client management server is provided by a cloud web service platform,
IP of the security gateway is characterized in that provided in the cloud web services platform, web browser based scraping method.

A web browser installed in the client device and having a user certificate for accessing the target server having the scraping target information, the web browser accepting the user's scraping request;
A scraping engine that receives a scraping request from the web browser;
A relay server (WSGS) for generating information (full text) satisfying the requirements of the target server for scraping;
Receives a scraping request and a full message from the relay server, accesses and scrapes the target server with an IP different from that of the relay server, transfers the scraped data to the relay server, and each of the plurality of security gateways having different IPs. (SG); And
It includes a client company server for managing the IP and the operation state of the plurality of security gateway,
The security gateway transmits its IP information and ready-to-use state to the client company server upon activation, and the relay server requests the client server server for the security gateway's IP when receiving a scraping request from the web browser. The server selects one of the plurality of security gateways in a ready state to transfer the IP of the selected security gateway to the relay server, and the scraping engine relays the scraping request information including a service script and certificate information of the client. A web browser based scraping system, characterized in that it is delivered to the server.