KR102620080B1

KR102620080B1 - Method and apparatus for processing order data

Info

Publication number: KR102620080B1
Application number: KR1020220185940A
Authority: KR
Inventors: 펑페이 수; 하오웬 주; 광유 수; 홍야오 자오
Original assignee: 쿠팡 주식회사
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2024-01-02

Abstract

본 발명은 전자 장치의 데이터 검증 방법에 관한 것으로서, 주문 데이터를 기반으로 주문에 대한 통계 데이터를 생성하는 단계; 상기 주문 데이터를 기반으로 상기 통계 데이터를 검증하기 위한 베이스라인 데이터를 생성하는 단계; 상기 베이스라인 데이터 및 상기 통계 데이터 각각의 증분값(incremental value)을 비교하는 단계; 및 상기 증분값의 비교를 기반으로 상기 통계 데이터를 검증하는 단계를 포함하는, 전자 장치의 데이터 검증 방법에 관한 것이다.The present invention relates to a data verification method for an electronic device, comprising: generating statistical data about an order based on order data; generating baseline data to verify the statistical data based on the order data; Comparing incremental values of each of the baseline data and the statistical data; and verifying the statistical data based on comparison of the incremental values.

Description

Method and device for processing order data {METHOD AND APPARATUS FOR PROCESSING ORDER DATA}

본 명세서의 실시 예는 주문 데이터를 처리하는 방법 및 장치에 관한 것이다.Embodiments of this specification relate to methods and devices for processing order data.

인터넷의 사용이 보편화됨에 따라 전자상거래 시장이 확대되고 있다. 특히 감염병의 확산에 따라, 오프라인 매장에 방문하여 상품을 구매하는 비중은 줄어들고 있는 반면, 컴퓨터 또는 스마트폰을 이용한 전자상거래를 통해 상품을 구매하는 비중은 급속도로 증가하고 있다.As the use of the Internet becomes more widespread, the e-commerce market is expanding. In particular, as infectious diseases spread, the proportion of products purchased by visiting offline stores is decreasing, while the proportion of products purchased through e-commerce using computers or smartphones is rapidly increasing.

전자상거래 업체는 사용자와 같은 엔티티의 전자상거래 서비스와 관련된 분석 및 의사 결정을 위해 주문량, 페이지 조회수, 아이템 클릭수 등과 같은 통계 데이터를 수집 및 생성할 필요가 있다. 주문 데이터를 기반으로 통계 데이터를 생성하기 위해서는 사용자의 모든 주문을 실시간으로 분석하는 것이 중요하지만, 이는 상당히 어렵다. 사용자의 주문 이력은 매우 방대한 양의 데이터를 차지하고, 이들 중 개별 주문은 언제든 생성되고 취소될 수 있는 변화하는 특성을 갖기 때문에, 주문 데이터를 실시간으로 분석하는 데에는 어려움이 따른다. 따라서, 변화하는 대용량의 데이터를 실시간으로 분석하여 통계 데이터를 생성할 수 있는 방법이 요구된다.E-commerce companies need to collect and generate statistical data such as order volume, page views, item clicks, etc. for analysis and decision-making related to e-commerce services for entities such as users. In order to generate statistical data based on order data, it is important to analyze all of a user's orders in real time, but this is quite difficult. A user's order history occupies a very large amount of data, and individual orders among these have changing characteristics that can be created and canceled at any time, making it difficult to analyze order data in real time. Therefore, there is a need for a method that can generate statistical data by analyzing large amounts of changing data in real time.

또한, 데이터 파이프라인을 통해 실시간으로 생성되는 통계 데이터를 검증(validation)하는 것은 매우 중요한 과제이다. 통계 데이터는 실시간으로 업데이트되며 사용자는 통계 데이터가 업데이트될 때마다 업데이트된 통계 데이터를 사용할 것이다. 다만, 전체 데이터 파이프라인의 임의의 부분에 문제가 생길 경우 데이터 파이프라인을 통해 생성되는 통계 데이터는 데이터 품질(data quality) 문제를 가질 수 있으며, 예를 들어 데이터 파이프라인 구현을 위해 새로 배포되는 코드에 버그가 있을 수 있다. 이러한 문제를 뒤늦게 발견하는 경우 전자상거래 서비스의 운영상 또다른 문제를 야기할 수 있으므로, 이를 방지하기 위해서 통계 데이터의 데이터 품질 검사를 주기적으로 수행하는 것이 필요하다. 다만, 메트릭(예를 들어, SDP 조회 수, 클릭 수, 주문 금액의 총합 등)과 같은 집계된 데이터의 데이터 덤프와 처리에 시간이 많이 걸려서 최신 데이터를 쉽게 확인하기 어렵고, 사용자의 주문량과 같은 데이터는 8TB의 데이터를 포함할 수 있어 대량의 데이터의 데이터 품질 검사를 효율적으로 수행하는 방법이 요구된다.Additionally, validating statistical data generated in real time through a data pipeline is a very important task. Statistical data is updated in real time and users will use the updated statistical data each time the statistical data is updated. However, if a problem occurs in any part of the entire data pipeline, statistical data generated through the data pipeline may have data quality problems, for example, newly distributed code to implement the data pipeline. There may be a bug. If these problems are discovered late, they may cause other problems in the operation of e-commerce services, so to prevent this, it is necessary to periodically perform data quality checks on statistical data. However, data dumping and processing of aggregated data such as metrics (e.g., number of SDP views, number of clicks, total order amount, etc.) takes a lot of time, making it difficult to easily check the latest data, and data such as user order amount can contain 8TB of data, so a method to efficiently perform data quality inspection of large amounts of data is required.

이와 같은 데이터 처리 방법에 대한 선행문헌으로 대한민국공개특허공보 제10-2009-0121754호가 있다.A prior document regarding such a data processing method is Korean Patent Publication No. 10-2009-0121754.

본 개시의 실시 예는 상술한 문제점을 해결하기 위하여 제안된 것으로, 실시간 주문 데이터를 기반으로 통계 데이터를 생성하고, 역시 주문 데이터를 기반으로 생성된 베이스라인 데이터와 통계 데이터 각각의 증분값을 비교하여 통계 데이터를 검증함으로써, 실시간으로 변화하는 대용량 데이터에 대한 통계 데이터 분석 및 검증을 효율적으로 수행하는 것을 목적으로 한다.The embodiment of the present disclosure is proposed to solve the above-mentioned problem, and generates statistical data based on real-time order data, and compares the incremental values of each statistical data with the baseline data also generated based on the order data. By verifying statistical data, the purpose is to efficiently perform statistical data analysis and verification of large amounts of data that change in real time.

상술한 과제를 달성하기 위하여, 본 개시의 일 실시 예에 따르는 전자 장치의 데이터 검증 방법은 주문 데이터를 기반으로 주문에 대한 통계 데이터를 생성하는 단계; 상기 주문 데이터를 기반으로 상기 통계 데이터를 검증하기 위한 베이스라인 데이터를 생성하는 단계; 상기 베이스라인 데이터 및 상기 통계 데이터 각각의 증분값(incremental value)을 비교하는 단계; 및 상기 증분값의 비교를 기반으로 상기 통계 데이터를 검증하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above-described problem, a data verification method for an electronic device according to an embodiment of the present disclosure includes generating statistical data about an order based on order data; generating baseline data to verify the statistical data based on the order data; Comparing incremental values of each of the baseline data and the statistical data; and verifying the statistical data based on comparison of the incremental values.

일 실시 예에 따라, 상기 주문 데이터는 하나 이상의 주문에 대한 주문 식별자 및 주문량을 포함하고, 상기 통계 데이터는 상기 주문량의 총합을 포함하는 것을 특징으로 한다.According to one embodiment, the order data includes an order identifier and order quantity for one or more orders, and the statistical data includes the total of the order quantities.

일 실시 예에 따라, 상기 통계 데이터를 생성하는 단계는 주문 취소 이벤트를 수신하는 단계; 상기 주문 취소 이벤트로부터 취소된 주문량을 확인하는 단계; 및 상기 취소된 주문량에 -1을 곱한 값을 상기 통계 데이터에 합산하는 단계를 더 포함하는 것을 특징으로 한다.According to one embodiment, generating the statistical data includes receiving an order cancellation event; Confirming the quantity of orders canceled from the order cancellation event; and adding the canceled order quantity multiplied by -1 to the statistical data.

일 실시 예에 따라, 상기 통계 데이터는 상기 주문 데이터의 정확히 한번 처리(exactly once processing)를 보장하는 데이터 파이프라인을 통해 생성되는 것을 특징으로 한다.According to one embodiment, the statistical data is characterized in that it is generated through a data pipeline that guarantees exactly once processing of the order data.

일 실시 예에 따라, 상기 데이터 파이프라인은 적어도 한번 처리(at least once processing)를 보장하는 제1 파이프라인 및 정확히 한번 처리를 보장하는 제2 파이프라인을 포함하며, 상기 제1 파이프라인의 출력은 상기 제2 파이프라인에 입력되는 것을 특징으로 한다.According to one embodiment, the data pipeline includes a first pipeline that guarantees at least once processing and a second pipeline that guarantees exactly once processing, and the output of the first pipeline is Characterized in that it is input to the second pipeline.

일 실시 예에 따라, 상기 통계 데이터를 생성하는 단계는 상기 제1 파이프라인에 상기 주문 데이터의 제n+1 변화를 나타내는 R(n+1)가 입력될 때: 캐시에서 상기 주문 데이터의 제n 변화를 나타내는 R(n)을 검색하는 단계(retrieving); 상기 R(n)의 값에 -1을 곱하는 단계; 상기 -R(n) 및 R(n+1)을 상기 제2 파이프라인으로 전송하는 단계; 및 상기 R(n+1)를 캐싱하는 단계를 포함하는 것을 특징으로 한다.According to one embodiment, the step of generating the statistical data is performed when R(n+1) indicating the n+1th change of the order data is input to the first pipeline: retrieving R(n) representing the change; Multiplying the value of R(n) by -1; transmitting -R(n) and R(n+1) to the second pipeline; and caching the R(n+1).

일 실시 예에 따라, 상기 통계 데이터를 생성하는 단계는 상기 제2 파이프라인에서 상기 -R(n) 및 R(n+1)를 집계함으로써(aggregating) 상기 통계 데이터를 생성하는 단계를 더 포함하는 것을 특징으로 한다.According to one embodiment, generating the statistical data further includes generating the statistical data by aggregating the -R(n) and R(n+1) in the second pipeline. It is characterized by

일 실시 예에 따라, 상기 R(n)은 주문 ID 및 주문 버전을 포함하는 것을 특징으로 한다.According to one embodiment, R(n) is characterized in that it includes an order ID and an order version.

일 실시 예에 따라, 상기 베이스라인 데이터를 생성하는 단계는 상기 주문 데이터를 기반으로 상기 베이스라인 데이터를 검증하는 단계를 더 포함하는 것을 특징으로 한다.According to one embodiment, the step of generating the baseline data may further include verifying the baseline data based on the order data.

일 실시 예에 따라, 상기 증분값은 상기 통계 데이터를 기반으로 설정된 주기로 계산되는 것을 특징으로 한다.According to one embodiment, the incremental value is calculated at a set cycle based on the statistical data.

본 개시의 일 실시 예에 따르는 데이터를 검증하는 전자 장치는, 적어도 하나의 명령어를 저장하는 메모리; 및 상기 적어도 하나의 명령어를 실행함으로써, 주문 데이터를 기반으로 주문에 대한 통계 데이터를 생성하고, 상기 주문 데이터를 기반으로 상기 통계 데이터를 검증하기 위한 베이스라인 데이터를 생성하고, 상기 베이스라인 데이터 및 상기 통계 데이터 각각의 증분값을 비교하고, 상기 증분값의 비교를 기반으로 상기 통계 데이터를 검증하는 프로세서를 포함하는 것을 특징으로 한다.An electronic device for verifying data according to an embodiment of the present disclosure includes: a memory storing at least one command; and generating statistical data for an order based on order data by executing the at least one command, generating baseline data for verifying the statistical data based on the order data, and generating the baseline data and the It is characterized in that it includes a processor that compares the increment values of each piece of statistical data and verifies the statistical data based on the comparison of the increment values.

본 개시의 일 실시 예에 따르는 비일시적 컴퓨터 판독 가능 저장 매체는, 컴퓨터 판독 가능 명령어들을 저장하도록 구성되는 매체를 포함하고, 상기 컴퓨터 판독 가능 명령어들은 프로세서에 의해 실행되는 경우 상기 프로세서가: 주문 데이터를 기반으로 주문에 대한 통계 데이터를 생성하는 단계; 상기 주문 데이터를 기반으로 상기 통계 데이터를 검증하기 위한 베이스라인 데이터를 생성하는 단계; 상기 베이스라인 데이터 및 상기 통계 데이터 각각의 증분값을 비교하는 단계; 및 상기 증분값의 비교를 기반으로 상기 통계 데이터를 검증하는 단계를 포함하는, 전자 장치의 데이터 검증 방법을 수행하도록 하는 것을 특징으로 한다.A non-transitory computer-readable storage medium according to an embodiment of the present disclosure includes a medium configured to store computer-readable instructions, wherein the computer-readable instructions, when executed by a processor, cause the processor to: generating statistical data about orders based on; generating baseline data to verify the statistical data based on the order data; comparing incremental values of each of the baseline data and the statistical data; and verifying the statistical data based on comparison of the incremental values.

본 개시의 일 실시 예에 따르면, 전자 장치는 주문의 통계 데이터에 반영해야 하는 데이터만을 반작용 레코드(counteraction record)와 같은 제1 데이터베이스에 저장하고, 제1 데이터베이스에서 취소된 주문량의 -1을 곱한 값을 활용하여 통계 데이터를 업데이트함으로써, 기존의 주문 이력의 모든 데이터를 검색하지 않고 메트릭과 같은 통계 데이터를 실시간으로 계산 및 업데이트할 수 있다.According to an embodiment of the present disclosure, the electronic device stores only data that should be reflected in the statistical data of the order in a first database, such as a counteraction record, and stores the amount of the canceled order in the first database multiplied by -1. By updating statistical data using , statistical data such as metrics can be calculated and updated in real time without retrieving all data from existing order history.

또한, 본 개시의 일 실시 예에 따르면, 전자 장치는 메시지 큐에 스트리밍되는 주문 데이터의 순서 이상이 생긴 경우나, 데이터 파이프라인의 데이터 처리 중 충돌이 발생한 경우에도 문제 발생에 영향을 받지 않고 변경된 주문 상태를 정상적으로 최종 통계 데이터에 반영할 수 있는 예외처리를 수행함으로써 실시간 데이터 처리의 무결성을 효과적으로 유지할 수 있다.In addition, according to an embodiment of the present disclosure, the electronic device is not affected by the problem and changes the order even when there is an order error in the order data streaming to the message queue or a conflict occurs during data processing in the data pipeline. The integrity of real-time data processing can be effectively maintained by performing exception handling that allows the state to be normally reflected in the final statistical data.

또한, 본 개시의 일 실시 예에 따르면, 전자 장치가 시점 T에 데이터 품질 검사를 했다면 시점 T 이전까지의 데이터의 정확성이 증명된 것이고, 시점 T+1에 데이터 품질 검사를 하기 위해서는 시점 T 내지 T+1 사이의 증분값에 대해서만 데이터 품질 검사를 수행하면 시점 T+1까지의 데이터의 정확성이 증명될 수 있으며, 이에 따라 불필요한 계산량의 증가 없이 전자 장치는 모든 시구간의 데이터의 무결성을 검증할 수 있다.In addition, according to an embodiment of the present disclosure, if the electronic device performs a data quality check at time T, the accuracy of the data before time T has been proven, and in order to perform a data quality check at time T+1, time T to T By performing a data quality check only on incremental values between +1, the accuracy of the data up to time T+1 can be proven, and thus the electronic device can verify the integrity of the data at all time intervals without unnecessary increase in the amount of computation. .

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 각 구성을 개략적으로 도시하는 예시적인 도면이다.
도 2는 본 개시의 일 실시 예에 따른 주문에 대한 통계 데이터를 계산하는 방법을 설명하기 위한 예시적인 도면이다.
도 3a 내지 도 3c는 본 개시의 일 실시 예에 따른 전자 장치의 통계 데이터 생성을 위한 실시간 데이터 파이프라인 디자인을 도시하는 예시적인 도면이다.
도 4는 본 개시의 일 실시 예에 따른 베이스라인 생성기 및 데이터 품질 검증기와 연관된 데이터 흐름을 나타내는 예시적인 아키텍처를 보여준다.
도 5a 및 도 5b는 본 개시의 일 실시 예에 따른 데이터 품질 검증기의 동작을 나타내는 예시적인 도면이다.
도 6은 본 개시의 일 실시 예에 따른 전자 장치의 데이터 검증 방법의 흐름을 나타내는 순서도이다.
도 7a 내지 도 7c는 본 개시의 복수의 실시 예에 따른 통계 데이터를 생성하기 위한 디자인을 설명하기 위한 예시적인 도면이다.1 is an exemplary diagram schematically illustrating each configuration of an electronic device according to an embodiment of the present disclosure.
Figure 2 is an exemplary diagram for explaining a method of calculating statistical data on orders according to an embodiment of the present disclosure.
3A to 3C are exemplary diagrams illustrating a real-time data pipeline design for generating statistical data of an electronic device according to an embodiment of the present disclosure.
4 shows an example architecture representing data flows associated with a baseline generator and a data quality verifier according to an embodiment of the present disclosure.
5A and 5B are exemplary diagrams showing the operation of a data quality verifier according to an embodiment of the present disclosure.
Figure 6 is a flowchart showing the flow of a data verification method for an electronic device according to an embodiment of the present disclosure.
7A to 7C are exemplary diagrams for explaining a design for generating statistical data according to a plurality of embodiments of the present disclosure.

이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings.

실시 예를 설명함에 있어서 본 발명이 속하는 기술 분야에 익히 알려져 있고 본 발명과 직접적으로 관련이 없는 기술 내용에 대해서는 설명을 생략한다. 이는 불필요한 설명을 생략함으로써 본 발명의 요지를 흐리지 않고 더욱 명확히 전달하기 위함이다.In describing the embodiments, description of technical content that is well known in the technical field to which the present invention belongs and that is not directly related to the present invention will be omitted. This is to convey the gist of the present invention more clearly without obscuring it by omitting unnecessary explanation.

마찬가지 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 개략적으로 도시되었다. 또한, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다. 각 도면에서 동일한 또는 대응하는 구성요소에는 동일한 참조 번호를 부여하였다.For the same reason, some components are exaggerated, omitted, or schematically shown in the accompanying drawings. Additionally, the size of each component does not entirely reflect its actual size. In each drawing, identical or corresponding components are assigned the same reference numbers.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide common knowledge in the technical field to which the present invention pertains. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

이 때, 처리 흐름도 도면들의 각 블록과 흐름도 도면들의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수 있음을 이해할 수 있을 것이다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 흐름도 블록(들)에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 흐름도 블록(들)에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 흐름도 블록(들)에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.At this time, it will be understood that each block of the processing flow diagram diagrams and combinations of the flow diagram diagrams can be performed by computer program instructions. These computer program instructions can be mounted on a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment, so that the instructions performed through the processor of the computer or other programmable data processing equipment are described in the flow chart block(s). It creates the means to perform functions. These computer program instructions may also be stored in computer-usable or computer-readable memory that can be directed to a computer or other programmable data processing equipment to implement a function in a particular manner, so that the computer-usable or computer-readable memory It is also possible to produce manufactured items containing instruction means that perform the functions described in the flowchart block(s). Computer program instructions can also be mounted on a computer or other programmable data processing equipment, so that a series of operational steps are performed on the computer or other programmable data processing equipment to create a process that is executed by the computer, thereby generating a process that is executed by the computer or other programmable data processing equipment. Instructions that perform processing equipment may also provide steps for executing the functions described in the flow diagram block(s).

또한, 각 블록은 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실행 예들에서는 블록들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Additionally, each block may represent a module, segment, or portion of code that includes one or more executable instructions for executing specified logical function(s). Additionally, it should be noted that in some alternative execution examples it is possible for the functions mentioned in the blocks to occur out of order. For example, it is possible for two blocks shown in succession to be performed substantially at the same time, or it is possible for the blocks to be performed in reverse order depending on the corresponding function.

이 때, 본 실시 예에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다. 구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로 더 분리될 수 있다. 뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU들을 재생시키도록 구현될 수도 있다.At this time, the term '~unit' used in this embodiment refers to software or hardware components such as FPGA or ASIC, and the '~unit' performs certain roles. However, '~part' is not limited to software or hardware. The '~ part' may be configured to reside in an addressable storage medium and may be configured to reproduce on one or more processors. Therefore, as an example, '~ part' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functions provided within the components and 'parts' may be combined into a smaller number of components and 'parts' or may be further separated into additional components and 'parts'. Additionally, components and 'parts' may be implemented to regenerate one or more CPUs within a device or a secure multimedia card.

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 각 구성을 개략적으로 도시하는 예시적인 도면이다.1 is an exemplary diagram schematically illustrating each configuration of an electronic device according to an embodiment of the present disclosure.

도 1을 참조하면, 전자 장치(100)는 프로세서(110) 및 메모리(120)를 포함할 수 있고, 주문 정보를 처리하는 방법을 수행할 수 있다. 도 1에 도시된 전자 장치(100)에는 본 실시 예들과 관련된 구성요소들만이 도시되어 있다. 따라서, 전자 장치(100)에는 도 1에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다.Referring to FIG. 1 , the electronic device 100 may include a processor 110 and a memory 120 and may perform a method of processing order information. In the electronic device 100 shown in FIG. 1, only components related to the present embodiments are shown. Accordingly, it is obvious to those skilled in the art that the electronic device 100 may further include other general-purpose components in addition to the components shown in FIG. 1 .

프로세서(110)는 전자 장치(100)에서의 주문 정보 처리를 위한 전반적인 기능들을 제어하는 역할을 한다. 예를 들어, 프로세서(110)는 전자 장치(100) 내의 메모리(120)에 저장된 프로그램들을 실행함으로써, 전자 장치(100)를 전반적으로 제어한다. 프로세서(110)는 전자 장치(100) 내에 구비된 CPU(central processing unit), GPU(graphics processing unit), AP(application processor) 등으로 구현될 수 있으나, 이에 제한되지 않는다.The processor 110 serves to control overall functions for processing order information in the electronic device 100. For example, the processor 110 generally controls the electronic device 100 by executing programs stored in the memory 120 within the electronic device 100. The processor 110 may be implemented as a central processing unit (CPU), a graphics processing unit (GPU), an application processor (AP), etc. provided in the electronic device 100, but is not limited thereto.

메모리(120)는 전자 장치(100) 내에서 처리되는 각종 데이터들을 저장하는 하드웨어로서, 메모리(120)는 전자 장치(100)에서 처리된 데이터들 및 처리될 데이터들을 저장할 수 있다. 또한, 메모리(120)는 전자 장치(100)에 의해 구동될 애플리케이션들, 드라이버들 등을 저장할 수 있다. 메모리(120)는 DRAM(dynamic random access memory), SRAM(static random access memory) 등과 같은 RAM(random access memory), ROM(read-only memory), EEPROM(electrically erasable programmable read-only memory), CD-ROM, 블루레이 또는 다른 광학 디스크 스토리지, HDD(hard disk drive), SSD(solid state drive), 또는 플래시 메모리를 포함할 수 있다.The memory 120 is hardware that stores various types of data processed within the electronic device 100. The memory 120 may store data processed and data to be processed in the electronic device 100. Additionally, the memory 120 may store applications, drivers, etc. to be run by the electronic device 100. The memory 120 includes random access memory (RAM) such as dynamic random access memory (DRAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), CD- It may include ROM, Blu-ray or other optical disk storage, a hard disk drive (HDD), a solid state drive (SSD), or flash memory.

일 실시 예에서, 전자 장치(100)는 주문 데이터를 기반으로 통계 데이터를 생성할 수 있다. 일 실시 예에서, 통계 데이터는 엔티티(entity)(예를 들어, 사용자, 카테고리, 프로모션)의 전자상거래 서비스와 관련된 특성을 나타내는 프로파일(profile) 또는 메트릭(metric)을 포함할 수 있으며, 예를 들어, 직접 사용될 수 있는 엔티티의 속성(예를 들어, 회원 가입 날짜), 특정 이벤트 타입에 대한 집계 데이터, 예측 메트릭 등을 포함할 수 있다. 집계 데이터는 로그인, 클릭 횟수, 접속 시간, 단일 상세 페이지(single detail page; SDP 조회 횟수와 같은 행동 통계 및 주문 횟수, GMV(gross merchandise volume, 총 판매 가치), 주문 금액, 주문량의 총합 등과 같은 거래 통계를 포함할 수 있다. 예측 메트릭은 예를 들어, 사용자의 "아이폰"에 대한 관심과 관련된 예측 스코어를 포함할 수 있다. 전자상거래 업체는 엔티티의 전자상거래 서비스와 관련된 분석 및 의사 결정을 위해 이와 같은 통계 데이터를 수집 및 생성할 필요가 있다.In one embodiment, the electronic device 100 may generate statistical data based on order data. In one embodiment, statistical data may include a profile or metric representing characteristics related to an e-commerce service of an entity (e.g., user, category, promotion), for example , may include attributes of entities that can be used directly (e.g., membership sign-up date), aggregated data for specific event types, predictive metrics, etc. Aggregated data includes behavioral statistics such as logins, number of clicks, access times, single detail page (SDP) views, and transactions such as number of orders, gross merchandise volume (GMV), order value, and total order volume. Statistics may include. Predictive metrics may include, for example, a predictive score related to a user's interest in “iPhone.” An e-commerce business may use these for analysis and decision-making related to the entity's e-commerce services. There is a need to collect and generate statistical data for the same.

주문 데이터를 기반으로 통계 데이터를 생성하기 위해서는 사용자의 모든 주문을 실시간으로 분석하는 것이 중요하지만, 이는 상당히 어렵다. 사용자의 주문 이력은 매우 방대한 양의 데이터를 차지하고, 이들 중 개별 주문은 언제든 생성되고 취소될 수 있는 변화하는 특성을 갖기 때문에, 주문 데이터를 실시간으로 분석하는 데에는 어려움이 따른다. In order to generate statistical data based on order data, it is important to analyze all of a user's orders in real time, but this is quite difficult. A user's order history occupies a very large amount of data, and individual orders among these have changing characteristics that can be created and canceled at any time, making it difficult to analyze order data in real time.

구체적으로, 사용자가 전자상거래 서비스를 통해 하나 이상의 아이템을 주문하면, 서비스 서버는 사용자의 주문 정보를 나타내는 주문 데이터를 생성할 수 있다. 일 실시 예에서, 주문 데이터는 하나 이상의 주문에 대한 주문 식별자 및 주문량을 포함할 수 있다. 주문 데이터가 생성된 이후, 사용자는 이 주문에 포함된 하나 이상의 아이템의 전부 또는 일부를 취소할 수 있으며, 이러한 주문의 취소는 주문 생성 날짜와 함께 사용자의 주문 프로필에 반영되어야 한다. 이하에서 도 2를 통해 사용자의 주문 생성 및 취소와 관련된 예시를 설명하기로 한다.Specifically, when a user orders one or more items through an e-commerce service, the service server may generate order data representing the user's order information. In one embodiment, order data may include an order identifier and order quantity for one or more orders. After order data has been created, the User may cancel all or part of one or more items included in this order, and such cancellation of an order shall be reflected in the User's order profile along with the order creation date. Below, an example related to creating and canceling a user's order will be described with reference to FIG. 2.

도 2는 본 개시의 일 실시 예에 따른 주문에 대한 통계 데이터를 계산하는 방법을 설명하기 위한 예시적인 도면이다. 도 2를 참조하면, 주문 생성 및 취소에 대한 예시가 주문 상세(210) 및 주문량 메트릭(220)에 대한 몇 개의 표를 통해 도시된다.Figure 2 is an exemplary diagram for explaining a method of calculating statistical data on orders according to an embodiment of the present disclosure. Referring to Figure 2, an example of order creation and cancellation is shown through several tables for order details 210 and order quantity metrics 220.

예를 들어, 2022년 10월 1일에 사용자 1은 두 개의 주문 (1001 및 1002)을 하였고, 각 주문은 100 및 200의 주문량을 갖는 경우, "사용자의 주문량"과 같은 메트릭은 300이 된다. 이후, 사용자 1은 2022년 10월 2일에 주문 1002에서 주문량 50만큼의 일부 아이템을 취소하였고, "사용자의 주문량"의 메트릭은 250으로 변경되어야 한다. 다만, 실시간 주문 메시지는 가장 최근의 스냅샷이기 때문에, 주문 100의 주문량이 현재 150인 것은 알 수 있지만, 이전에 그의 주문량이 200이었다는 것은 알 수 없으며, 이는 실시간 주문량의 변화 계산을 어렵게 한다. "사용자의 주문량"의 메트릭을 계산하기 위해서는 일반적으로 사용자의 모든 과거 주문 이력 데이터에 접근하여 주문량을 합산해야 하는데, 주문 이력 데이터는 매우 방대한 양의 데이터를 갖고, 이를 기반으로 실시간으로 메트릭을 계산하는 것은 쉽지 않다. 예를 들어, 현재 주문 데이터 셋의 크기가 8TB인 경우, 사용자의 주문 횟수 또는 주문량과 같은 사용자의 메트릭을 실시간으로 계산하는 것은 어려움이 있다.For example, on October 1, 2022, User 1 placed two orders (1001 and 1002), each with order quantities of 100 and 200, then a metric such as "User's Order Quantity" would be 300. Afterwards, User 1 canceled some items with order quantity 50 in order 1002 on October 2, 2022, and the metric of “User’s order quantity” should be changed to 250. However, because the real-time order message is the most recent snapshot, we can know that the order quantity of order 100 is currently 150, but we cannot know that his order quantity was 200 before, which makes it difficult to calculate the change in real-time order quantity. In order to calculate the metric of "user's order quantity", it is generally necessary to access all of the user's past order history data and add up the order quantity. Order history data has a very large amount of data, and metrics are calculated in real time based on this. It's not easy. For example, if the size of the current order data set is 8TB, it is difficult to calculate user metrics in real time, such as the number of orders or quantity of orders.

이와 같은 어려움을 해결하기 위해, 일 실시 예에서, 전자 장치(100)는 주문의 상태 변화를 처리하기 위해 제1 데이터베이스를 사용하여 일부 기록을 저장하고, 주문과 관련된 메트릭의 계산을 증분적으로(incrementally) 수행할 수 있다. 예를 들어, 전자 장치(100)는 "사용자의 주문량" 메트릭을 계산함에 있어서, 사용자가 이전 주문을 취소하였고 취소된 주문량이 200인 경우, 그의 반작용값(counteraction value)을 -200으로 결정할 수 있다. 전자 장치(100)는 이 반작용값 -200을 사용자의 주문량에 더하기만 하면 되고, 사용자의 모든 주문량을 더할 필요가 없으므로 계산에 필요한 데이터 양이 효과적으로 줄어들게 된다. 전자 장치(100)는 스트리밍되는 실시간 데이터 중 주문의 통계 데이터에 반영해야 하는 데이터만을 반작용 레코드(counteraction record)와 같은 제1 데이터베이스에 저장하고, 제1 데이터베이스에서 취소된 주문량의 -1을 곱한 값을 활용하여 통계 데이터를 업데이트함으로써, 기존의 주문 이력의 모든 데이터를 검색하지 않고 메트릭을 실시간으로 계산할 수 있다.To address this difficulty, in one embodiment, the electronic device 100 stores some records using a first database to process changes in the status of an order and incrementally calculates metrics related to the order. It can be performed incrementally. For example, in calculating the “user's order quantity” metric, if the user canceled a previous order and the canceled order quantity is 200, the electronic device 100 may determine its counteraction value to be -200. . The electronic device 100 only needs to add this reaction value -200 to the user's order quantity, and there is no need to add all of the user's order quantity, effectively reducing the amount of data required for calculation. The electronic device 100 stores only the data that should be reflected in the statistical data of the order among the streaming real-time data in a first database, such as a counteraction record, and multiplies the amount of the canceled order in the first database by -1. By utilizing statistical data to update, metrics can be calculated in real time without retrieving all data from existing order history.

마찬가지로 도 2의 주문량 메트릭(220)에 도시된 예시에서, "사용자의 주문량" 메트릭은 2022년 10월 1일 300으로 생성되며, 사용자의 50만큼의 주문량 취소로 인해 메트릭은 2022년 10월 2일에 -50으로 처리된다. 전자 장치(100)는 주문 취소 이벤트를 수신하고, 주문 취소 이벤트로부터 취소된 주문량을 확인하고, 취소된 주문량에 -1을 곱한 값을 통계 데이터에 합산함으로써 통계 데이터를 생성할 수 있다.Similarly, in the example shown in order quantity metric 220 of FIG. 2, the “User’s Order Quantity” metric is created as 300 on October 1, 2022, and due to the user’s order quantity cancellation by 50, the metric is created on October 2, 2022. is treated as -50. The electronic device 100 may generate statistical data by receiving an order cancellation event, checking the canceled order quantity from the order cancellation event, and adding the canceled order quantity multiplied by -1 to the statistical data.

도 3a 내지 도 3c는 본 개시의 일 실시 예에 따른 전자 장치의 통계 데이터 생성을 위한 실시간 데이터 파이프라인 디자인을 도시하는 예시적인 도면이다.3A to 3C are exemplary diagrams illustrating a real-time data pipeline design for generating statistical data of an electronic device according to an embodiment of the present disclosure.

도 3a를 참조하면, 전자 장치(100)가 주문 데이터(310)로부터 메트릭(340)과 같은 통계 데이터를 생성하기 위한 종단간(end-to-end) 실시간 데이터 파이프라인 아키텍처(300)가 도시된다. 전자 장치(100)는 아키텍처(300)와 같은 데이터 파이프라인을 구축하여 실시간 스트리밍 데이터의 추출, 변경, 결합, 검증 및 적재의 과정을 자동화할 수 있다. 3A, an end-to-end real-time data pipeline architecture 300 is shown for the electronic device 100 to generate statistical data, such as metrics 340, from order data 310. . The electronic device 100 can build a data pipeline like the architecture 300 to automate the process of extracting, changing, combining, verifying, and loading real-time streaming data.

일 실시 예에서, 통계 데이터는 주문 데이터(310)의 정확히 한번 처리(exactly once processing)를 보장하는 데이터 파이프라인을 통해 생성될 수 있다. 일 실시 예에서, 데이터 파이프라인은 적어도 한번 처리(at least once processing)를 보장하는 제1 파이프라인(320)(예를 들어, 스트리밍 수집 파이프라인(streaming ingestion pipeline)) 및 정확히 한번 처리를 보장하는 제2 파이프라인(330)(예를 들어, 실시간 집계 파이프라인(real time aggregation pipeline))을 포함할 수 있으며, 제1 파이프라인의 출력은 제2 파이프라인에 입력될 수 있다. 실시간 데이터 처리에서는 계산 결과의 정확성을 보장하는 것이 중요하며, 정확한 계산을 위한 데이터 파이프라인은 정확히 한번 처리 시맨틱(semantic)을 내재하여야 할 수 있다. 정확히 한번 처리는 모든 데이터가 최종 결과에 정확히 한번만 반영되어야 함을 의미할 수 있다. 예를 들어, 사용자가 주문을 1회 취소하였음에도 실시간으로 스트리밍되는 취소 데이터 로그는 10회 중복되어 입력될 가능성이 있고, 전자 장치(100)는 이러한 중복 데이터를 제거하고 1회의 취소 주문만 반영할 필요가 있다. 한편, 제1 파이프라인(320)은 적어도 한번 처리를 보장할 뿐 정확히 한번 처리를 보장하지 못하기 때문에 중복 데이터를 허용할 수 있으며, 중복 데이터는 제2 파이프라인(330)에서 중복 제거(deduplication)될 수 있다.In one embodiment, statistical data may be generated through a data pipeline that ensures exactly once processing of order data 310. In one embodiment, the data pipeline includes a first pipeline 320 (e.g., a streaming ingestion pipeline) that guarantees at least once processing and a first pipeline 320 that guarantees exactly once processing. It may include a second pipeline 330 (eg, a real time aggregation pipeline), and the output of the first pipeline may be input to the second pipeline. In real-time data processing, it is important to ensure the accuracy of calculation results, and the data pipeline for accurate calculation may need to have exactly once processing semantics. Exactly once processing can mean that all data should be reflected in the final result exactly once. For example, even if the user cancels the order once, the cancellation data log streamed in real time may be entered repeatedly 10 times, and the electronic device 100 needs to remove such duplicate data and reflect only one canceled order. There is. Meanwhile, the first pipeline 320 only guarantees processing at least once and does not guarantee processing exactly once, so duplicate data can be accepted, and duplicate data is deduplicated in the second pipeline 330. It can be.

일 실시 예에서, 제2 파이프라인(330)은 정확히 한번 처리를 보장하기 위해, 적어도 한번 처리, 중복 제거 및 멱등성 작성(idempotent write) 시맨틱을 보장할 수 있다. 적어도 한번 처리 시맨틱은 모든 데이터가 처리되도록 보장하고, 중복 제거는 입력의 중복된 데이터가 제거되도록 보장하며, 멱등성 작성은 복수 이후의 데이터를 재처리하는 것은 최종 출력 결과에 영향을 미치지 않는 것, 즉, 연산을 여러 번 적용하더라도 결과가 달라지지 않는 성질을 보장할 수 있다.In one embodiment, the second pipeline 330 may guarantee at least once processing, deduplication, and idempotent write semantics to ensure exactly once processing. Process at least once semantics ensure that all data is processed, deduplication ensures that duplicate data in the input is removed, idempotent writing ensures that reprocessing data after multiple times does not affect the final output result, In other words, it is possible to guarantee that the result will not change even if the operation is applied multiple times.

일 실시 예에서, 전자 장치(100)는 제2 파이프라인(330)에서 멱등성 작성을 보장하기 위해 마이크로 배치(micro batch)의 엔티티 데이터 세트를 고유하게 나타낼 수 있는 고유한 식별자를 생성할 수 있다. 제2 파이프라인(330)에서 데이터가 상이한 계산 로직을 사용하여 집계되기 때문에 원래의 입력 데이터 식별자는 더 이상 사용되지 않는다. 고유한 식별자는 엔티티 ID, 마이크로 배치 ID 및 메트릭 ID의 조합으로 설정될 수 있다. 마이크로 배치 ID는 모든 마이크로 배치의 시작 오프셋(starting offset)의 다이제스트(digest)를 사용하여 생성될 수 있고, 메트릭 ID는 상이한 계산 로직의 고유한 식별자를 포함할 수 있다. 이와 같이, 시스템의 재시작 또는 충돌로부터 복구된 이후, 전자 장치(100)는 마지막으로 실패한 마이크로 배치의 시작 오프셋부터 재처리하며, 마이크로 배치 ID는 마지막 처리와 동일할 것이고 데이터는 동일한 식별자로 겹쳐쓰기(overwrite)될 것이므로 멱등성 작성이 보장될 수 있다. In one embodiment, electronic device 100 may generate a unique identifier that can uniquely represent a micro batch of entity data sets to ensure idempotent writing in second pipeline 330. . Because the data in the second pipeline 330 is aggregated using different calculation logic, the original input data identifier is no longer used. The unique identifier can be set as a combination of entity ID, micro-batch ID, and metric ID. The micro-batch ID can be generated using a digest of the starting offset of every micro-batch, and the metric ID can contain a unique identifier of different calculation logic. As such, after a restart of the system or recovery from a crash, the electronic device 100 reprocesses from the start offset of the last failed microbatch, the microbatch ID will be the same as the last processing and the data is overwritten with the same identifier ( Since it will be overwritten, idempotent writing can be guaranteed.

일 실시 예에서, 메시지 큐(350)는 실시간으로 스트리밍되는 주문 데이터(310)를 게시, 구독, 저장 및 처리할 수 있는 아파치 카프카(Apache Kafka)를 포함할 수 있다. 아파치 카프카는 대용량 실시간 로그 처리 시스템으로서, 발행/구독(publish/subscribe) 패러다임을 사용하여 메시지를 처리하는 메시지 큐잉(queueing) 시스템이다. 아파치 카프카는 확장성이 뛰어나고 처리량이 높은 분산 메시지 시스템으로, 안정적이고 지속적인 메시지 전달 성능을 보장한다. 카프카는 생산자-소비자 문제를 응용하여, 프로듀서(producer)/컨슈머(consumer)/브로커(broker)의 세가지의 시스템 구성요소를 사용하여, 대량의 데이터의 토픽(topic)을 설정하여 토픽을 기준으로 파티션을 구성해 순서대로 저장한다. 카프카에서는 이 저장된 데이터들을 순차적으로 컨슈머에게 전달하여 효율적으로 처리를 진행하게 된다. 카프카는 대용량의 실시간 로그 처리에 특화되어 있는 솔루션이며 데이터를 유실 없이 안전하게 전달하는 것이 주목적인 메시지 시스템에서 고장 허용(fault-tolerant)의 안정적인 아키텍처와 신속한 성능으로 데이터를 처리할 수 있다. In one embodiment, message queue 350 may include Apache Kafka, which can publish, subscribe, store, and process order data 310 streaming in real time. Apache Kafka is a large-capacity real-time log processing system and a message queuing system that processes messages using the publish/subscribe paradigm. Apache Kafka is a highly scalable, high-throughput distributed messaging system that guarantees stable and continuous message delivery performance. Kafka applies the producer-consumer problem and uses three system components, producer/consumer/broker, to set topics for large amounts of data and partition them based on the topics. Configure and save in order. In Kafka, this stored data is sequentially delivered to the consumer for efficient processing. Kafka is a solution specialized in large-capacity, real-time log processing, and can process data with a fault-tolerant, stable architecture and rapid performance in a messaging system whose main purpose is to safely deliver data without loss.

일 실시 예에서, 전자 장치(100)가 주문 데이터를 기반으로 통계 데이터를 생성하는 것은, 제1 파이프라인(320)에 주문 데이터의 제n+1 변화를 나타내는 R(n+1)가 입력될 때, 캐시에서 주문 데이터의 제n 변화를 나타내는 R(n)을 검색하고(retrieve), R(n)의 값에 -1을 곱하고, -R(n) 및 R(n+1)을 제2 파이프라인(330)으로 전송하고, R(n+1)을 캐싱하는 것을 포함할 수 있다. 전자 장치(100)는 제1 파이프라인(320)으로부터 제2 파이프라인(330)으로 전송된 -R(n) 및 R(n+1)을 집계함으로써 통계 데이터를 생성할 수 있다. 즉, 제2 파이프라인(330)에서 통계 데이터는 R(0) + [R(1) - R(0)] + ... + [R(n) - R(n-1)] + [R(n+1) - R(n)] = R(n+1)로 계산될 수 있다. In one embodiment, the electronic device 100 generates statistical data based on order data when R(n+1) representing the n+1 change in order data is input to the first pipeline 320. When, retrieve R(n) representing the nth change of order data from the cache, multiply the value of R(n) by -1, and replace -R(n) and R(n+1) with the second It may include transmitting to the pipeline 330 and caching R(n+1). The electronic device 100 may generate statistical data by counting -R(n) and R(n+1) transmitted from the first pipeline 320 to the second pipeline 330. That is, the statistical data in the second pipeline 330 is R(0) + [R(1) - R(0)] + ... + [R(n) - R(n-1)] + [R (n+1) - R(n)] = R(n+1).

일 실시 예에서, R(n)은 주문 ID 및 주문 버전에 대한 정보를 포함할 수 있다. 전자 장치(100)는 R(n)에 포함된 주문 ID 및 주문 버전에 대한 정보를 확인하고, 메트릭에 반영하고자 하는 주문 ID 또는 주문 버전에 대응하는 주문 상태(생성 또는 취소)를 확인함으로써 메트릭을 생성할 수 있다.In one embodiment, R(n) may include information about the order ID and order version. The electronic device 100 verifies information about the order ID and order version included in R(n), and determines the order status (created or canceled) corresponding to the order ID or order version to be reflected in the metric, thereby generating the metric. can be created.

이하에서는, i) 메시지 큐에 스트리밍되는 주문 데이터의 순서 이상(disorder)이 생긴 경우와, ii) 데이터 파이프라인의 데이터 처리 중 충돌(crash)이 발생한 경우, 이러한 문제 발생에 영향을 받지 않고 변경된 주문 상태를 정상적으로 최종 통계 데이터에 반영할 수 있는 예외처리 방법에 대해 설명한다.In the following, i) when there is a disorder in the order of order data streaming to the message queue, and ii) when a crash occurs during data processing in the data pipeline, the order is changed without being affected by these problems. This explains the exception handling method that allows the status to be normally reflected in the final statistical data.

일 실시 예에서, 전자 장치(100)는 주문 소스 메시지 큐의 순서 이상이 생긴 경우, 가장 최신 버전의 주문 데이터만 반영되도록 예외처리를 수행할 수 있다. 예를 들어, R(n+k+m)이 이전 주문인 R(n+k) 내지 R(n+k+m-1)보다 먼저 수신된 경우, 제2 파이프라인(330)에서 R(n+k+m)이 집계될 때, 통계 데이터는 R(0) + [R(1) - R(0)] + ... + [R(n+k-1) - R(n+k-2)] + [R(n+k+m) - R(n+k-1)] = R(n+k+m)으로 처리될 수 있다. 즉, 전자 장치(100)는 이와 같은 제2 파이프라인(330)의 집계 방식을 통해 가장 최신의 주문 데이터 R(n+k+m)만 통계 데이터에 반영하고, 이전 주문 데이터인 R(n+k) 내지 R(n+k+m-1)의 메시지를 폐기함으로써 문제 발생에 영향을 받지 않고 변경된 주문 상태를 최종적으로 통계 데이터에 반영할 수 있다.In one embodiment, when an order error occurs in the order source message queue, the electronic device 100 may perform exception processing so that only the most recent version of the order data is reflected. For example, if R(n+k+m) is received before previous orders R(n+k) to R(n+k+m-1), in the second pipeline 330, R(n When +k+m) are aggregated, the statistical data is R(0) + [R(1) - R(0)] + ... + [R(n+k-1) - R(n+k- 2)] + [R(n+k+m) - R(n+k-1)] = R(n+k+m). That is, the electronic device 100 reflects only the most recent order data R(n+k+m) in the statistical data through the aggregation method of the second pipeline 330, and the previous order data R(n+ By discarding messages from k) to R(n+k+m-1), the changed order status can be finally reflected in the statistical data without being affected by the occurrence of problems.

일 실시 예에서, 전자 장치(100)는 데이터 파이프라인의 데이터 처리 중 충돌이 발생한 경우, 충돌에 영향을 받지 않고 변경된 주문 상태를 통계 데이터에 반영할 수 있도록 예외처리를 수행할 수 있다. 제1 파이프라인(320)에서, 일반적인 경우의 데이터 처리 단계는 다음과 같은 N1 내지 N5와 같이 수행될 수 있다:In one embodiment, when a conflict occurs during data processing in the data pipeline, the electronic device 100 may perform exception processing so that the changed order status can be reflected in statistical data without being affected by the conflict. In the first pipeline 320, the data processing steps in the general case may be performed as follows N1 to N5:

N1: 소스 메시지 큐를 확인하고 유용한 필드를 추출한다.N1: Check the source message queue and extract useful fields.

N2: 결격인(disqualified) 메시지를 필터링한다.N2: Filter disqualified messages.

N3: 데이터를 이벤트로 변환한다.N3: Convert data into events.

N4: 변환된 이벤트를 다운스트림의 이벤트 메시지 큐로 전송한다.N4: Transmit the converted event to the downstream event message queue.

N5: 소스 메시지 큐 오프셋을 커밋(commit)한다.N5: Commit the source message queue offset.

여기서 소스 메시지 큐 오프셋을 커밋한다는 것은 다음 메시지를 소스에서 처리할 준비가 되었다는 것을 나타낸다.Here, committing the source message queue offset indicates that the next message is ready to be processed at the source.

한편, 전자 장치(100)는 반작용 레코드(counteraction record, 360)에 대해, 다음의 추가 단계 C1 내지 C4를 더 수행할 수 있다:Meanwhile, the electronic device 100 may further perform the following additional steps C1 to C4 on the counteraction record (360):

C1: 데이터베이스 관리 시스템(database management system, DBMS)으로부터 이전에 저장된 이벤트를 수신한다.C1: Receives previously stored events from a database management system (DBMS).

C2: 반작용 레코드로부터 반작용 값(counteraction value)을 계산한다. C2: Calculate the counteraction value from the reaction record.

C3: 일반적인 이벤트와 함께 계산된 반작용 값을 전송한다.C3: Sends the calculated reaction value along with the general event.

C4: DBMS에서 이전 이벤트를 새로운 이벤트로 오버라이딩(override)한다.C4: In DBMS, the previous event is overridden by a new event.

여기서, 추가 단계 C1 및 C2는 N3와 결합될 수 있다. N3 또는 N3 이전 단계에서 충돌이 발생하더라도, 소스 메시지 큐의 메시지가 다시 판독되고 처리되기 때문에 충돌의 영향이 없을 수 있다. 추가 단계 C3는 N4와 함께 수행될 수 있다.Here, further steps C1 and C2 can be combined with N3. Even if a collision occurs at stage N3 or before N3, there may be no impact of the collision because the messages in the source message queue are read and processed again. Additional step C3 may be performed together with N4.

일 실시 예에서, 추가 단계 C4는 N4 이전에 수행될 수 있다. 즉, 전자 장치(100)는 DBMS에서 이전 이벤트를 새로운 이벤트로 오버라이딩한 후(C4), 변환된 이벤트를 다운스트림의 이벤트 메시지 큐로 전송할 수 있다(N4). 이때, 충돌로 인해 N4가 실패한 경우, 제1 경우에서 R(n)을 R(n+1)로 대체한 이후 제1 파이프라인(320)에서 충돌이 발생할 수 있다. 복구 이후, 전자 장치(100)는 R(n+1)을 다운스트림으로 다시 전송할 것이고, 이는 제2 파이프라인(330)에서 R(n) + R(n+1)를 기반으로 계산이 수행되게 할 수 있다. 한편, 충돌로 인해 N4가 실패한 경우, 제2 경우에서 R(n+1) 또는 -R(n) 중 오직 하나만이 다운스트림으로 전송되고 제1 파이프라인(320)에서 충돌이 발생할 수 있다. 복구 이후, 만약 -R(n)이 전송되었다면 괜찮지만, R(n+1)만이 전송되었다면 이는 제2 파이프라인(330)에서 R(n) + R(n+1)를 기반으로 계산이 수행되게 할 수 있다. 결국, 제1 및 제2 경우 모두 전자 장치(100)로 하여금 제2 파이프라인(330)에서 R(n) + R(n+1)을 기반으로 계산을 수행하게 하여 문제가 발생할 수 있다.In one embodiment, an additional step C4 may be performed before N4. That is, the electronic device 100 may override a previous event with a new event in the DBMS (C4) and then transmit the converted event to the downstream event message queue (N4). At this time, if N4 fails due to a conflict, a conflict may occur in the first pipeline 320 after R(n) is replaced with R(n+1) in the first case. After recovery, electronic device 100 will transmit R(n+1) again downstream, which causes calculations to be performed based on R(n) + R(n+1) in second pipeline 330. can do. Meanwhile, if N4 fails due to a collision, in the second case, only one of R(n+1) or -R(n) is transmitted downstream and a collision may occur in the first pipeline 320. After recovery, if -R(n) is transmitted, it is okay, but if only R(n+1) is transmitted, calculation is performed based on R(n) + R(n+1) in the second pipeline 330. It can be done. Ultimately, in both the first and second cases, a problem may occur because the electronic device 100 performs calculations based on R(n) + R(n+1) in the second pipeline 330.

일 실시 예에서, 추가 단계 C4는 N4 이후에 수행될 수 있다. 즉 전자 장치(100)는 변환된 이벤트를 다운스트림의 이벤트 메시지 큐로 전송한 후(N4), DBMS에서 이전 이벤트를 새로운 이벤트로 오버라이딩할 수 있다(C4). 이때, N4가 부분적으로 실패한 경우(제1 경우), R(n+1) 또는 -R(n) 중 오직 하나만이 다운스트림으로 전송되고 제1 파이프라인(320)에서 충돌이 발생할 수 있다. 복구 이후, R(n+1) 및 -R(n)이 전송될 것이며, 제2 파이프라인(330)에서는 이전의 수신된 기록을 중복제거할 수 있다. 또한, N4가 수행되었지만 C4가 실패한 경우(제2 경우), 오직 중복 기록 R(n+1) 및 -R(n)이 다시 다운스트림으로 재전송될 것이다. 한편, N4 및 C4가 수행되었지만 N5가 실패한 경우(제3 경우), 오직 중복 기록 R(n+1)만이 다운스트림으로 재전송될 것이며, 중복 데이터를 전송하여도 정확히 한번 처리를 보장하는 제2 파이프라인(330)에서 중복 제거가 될 것이기 때문에 문제가 발생하지 않을 수 있다. 따라서, 바람직하게는, 전자 장치(100)는 추가 단계 C4를 N4 이후에 수행할 수 있다.In one embodiment, an additional step C4 may be performed after N4. That is, the electronic device 100 can transmit the converted event to the downstream event message queue (N4) and then override the previous event with a new event in the DBMS (C4). At this time, if N4 partially fails (first case), only one of R(n+1) or -R(n) is transmitted downstream and a collision may occur in the first pipeline 320. After recovery, R(n+1) and -R(n) will be transmitted, and the second pipeline 330 may deduplicate previously received records. Additionally, if N4 is performed but C4 fails (second case), only redundant records R(n+1) and -R(n) will be retransmitted back downstream. On the other hand, if N4 and C4 are performed but N5 fails (case 3), only redundant records R(n+1) will be retransmitted downstream, and the second pipe guarantees exactly once processing even if redundant data is transmitted. Since deduplication will be done in line 330, there may not be a problem. Therefore, preferably, the electronic device 100 can perform the additional step C4 after N4.

일 실시 예에서, DBMS는 오픈 소스 분산형 No-SQL DBMS인 아파치 카산드라(Apache Cassandra)를 포함할 수 있다. 카산드라는 단일 장애점(single point of failure; SPOF) 없이 고성능을 제공하면서 수많은 서버 간의 대용량의 데이터를 관리할 수 있으며, 여러 데이터 센터에 걸쳐 클러스터를 지원하고, 모든 클라이언트에 대한 낮은 레이턴시 운영을 허용할 수 있다.In one embodiment, the DBMS may include Apache Cassandra, an open source distributed No-SQL DBMS. Cassandra can manage large amounts of data across numerous servers while delivering high performance without a single point of failure (SPOF), supports clusters across multiple data centers, and allows low-latency operation for all clients. You can.

일 실시 예에서, DBMS는 키-값(key-value) 구조의 오픈 소스 비관계형 DBMS인 레디스(Redis)를 포함할 수 있다. In one embodiment, the DBMS may include Redis, an open source non-relational DBMS with a key-value structure.

이와 같이, 전자 장치(100)는 메시지 큐에 스트리밍되는 주문 데이터의 순서 이상이 생긴 경우나, 데이터 파이프라인의 데이터 처리 중 충돌이 발생한 경우에도 문제 발생에 영향을 받지 않고 변경된 주문 상태를 정상적으로 최종 통계 데이터에 반영할 수 있는 예외처리를 수행함으로써 실시간 데이터 처리의 무결성을 효과적으로 유지할 수 있다.In this way, the electronic device 100 is not affected by the problem even when there is an order error in the order data streaming to the message queue or a conflict occurs during data processing in the data pipeline, and the changed order status is normally reported as final statistics. By performing exception handling that can be reflected in the data, the integrity of real-time data processing can be effectively maintained.

도 3b를 참조하면, 실시간 데이터 파이프라인 아키텍처(300)와, 아키텍처(300)에 입력될 수 있는 데이터 소스 메타데이터(371) 및 메트릭 메타데이터(372)가 도시된다. 일 실시 예에서, 전자 장치(100)는 데이터 소스 메타데이터(371)를 통해 데이터 소스를 등록하고, 이 데이터 소스로부터 어떤 비즈니스 필드가 추출될 수 있는지를 명시할 수 있다. 일 실시 예에서, 전자 장치(100)는 메트릭 메타데이터(372)를 통해, 요구되는 메트릭을 명시하고, 메트릭에 대한 운영 및 조건에 대한 정보(예를 들어, 통계가 수집되는 서비스의 타입, 통계 수집 시작 및 종료일 등)를 설정할 수 있다.3B, a real-time data pipeline architecture 300 and data source metadata 371 and metric metadata 372 that can be input to the architecture 300 are shown. In one embodiment, the electronic device 100 may register a data source through data source metadata 371 and specify which business fields can be extracted from this data source. In one embodiment, the electronic device 100 specifies the required metric through metric metadata 372, and provides information about the operation and conditions for the metric (e.g., type of service for which statistics are collected, statistics Collection start and end dates, etc.) can be set.

도 3c를 참조하면, 실시간 데이터 파이프라인 아키텍처(300)와, 메트릭이 생성되는 과정(381 내지 385)을 나타내는 예시적인 정보가 도시된다. 381에서, 소스 주문 이벤트가 메시지 큐(350)에 새롭게 들어온 경우, 제1 파이프라인(320)은 이벤트를 판독하고 데이터 소스 메타데이터(371)의 구성에 따라 이를 전송할 수 있다. 382에서, 요구되는 주문 이벤트가 처리되고, 요구되는 비즈니스 필드가 추출되어 메트릭 메타데이터(372)의 구성에 따라 관련된 비즈니스 메시지 큐 토픽에 작성될 수 있다. 383 및 384에서, 제2 파이프라인(330)은 메트릭 메타데이터(372)에 따라 파인 그레인드(fine-grained) 레벨로 요구되는 메트릭을 집계할 수 있다. 385에서, 퍼블리셔는 최종 메트릭(340)을 획득하여 주문 프로파일 메시지 큐 토픽에 푸시(push)할 수 있다.Referring to FIG. 3C, exemplary information representing a real-time data pipeline architecture 300 and processes 381 to 385 through which metrics are generated is shown. At 381, when a source order event is new in message queue 350, first pipeline 320 may read the event and transmit it according to the configuration of data source metadata 371. At 382, the required order event is processed, and the required business fields may be extracted and written to the associated business message queue topic according to the configuration of the metrics metadata 372. At 383 and 384, the second pipeline 330 may aggregate the required metrics at a fine-grained level according to the metric metadata 372. At 385, the publisher may obtain the final metrics 340 and push them to the order profile message queue topic.

한편, 데이터 파이프라인을 통해 실시간으로 생성되는 통계 데이터를 검증(validation)하는 것은 매우 중요한 과제이다. 통계 데이터는 실시간으로 업데이트되며 사용자는 통계 데이터가 업데이트될 때마다 업데이트된 통계 데이터를 사용할 것이다. 다만, 전체 데이터 파이프라인의 임의의 부분에 문제가 생길 경우 데이터 파이프라인을 통해 생성되는 통계 데이터는 데이터 품질(data quality) 문제를 가질 수 있으며, 예를 들어 데이터 파이프라인 구현을 위해 새로 배포되는 코드에 버그가 있을 수 있다. 이러한 문제를 뒤늦게 발견하는 경우 전자상거래 서비스의 운영상 또다른 문제를 야기할 수 있으므로, 이를 방지하기 위해서 통계 데이터의 데이터 품질 검사를 주기적으로 수행하는 것이 필요하다. 다만, 메트릭(예를 들어, SDP 조회 수, 클릭 수, 주문 금액의 총합 등)과 같은 집계된 데이터의 데이터 덤프와 처리에 시간이 많이 걸려서 최신 데이터를 쉽게 확인하기 어렵고, 사용자의 주문량과 같은 데이터는 8TB의 데이터를 포함할 수 있어 대량의 데이터의 데이터 품질 검사를 효율적으로 수행하는 방법이 요구된다. 통계 데이터 검증과 관련된 실시 예가 이하 도 4 및 도 5를 통해 설명된다.Meanwhile, validating statistical data generated in real time through a data pipeline is a very important task. Statistical data is updated in real time and users will use the updated statistical data each time the statistical data is updated. However, if a problem occurs in any part of the entire data pipeline, statistical data generated through the data pipeline may have data quality problems, for example, newly distributed code to implement the data pipeline. There may be a bug. If these problems are discovered late, they may cause other problems in the operation of e-commerce services, so to prevent this, it is necessary to periodically perform data quality checks on statistical data. However, data dumping and processing of aggregated data such as metrics (e.g., number of SDP views, number of clicks, total order amount, etc.) takes a lot of time, making it difficult to easily check the latest data, and data such as user order amount can contain 8TB of data, so a method to efficiently perform data quality inspection of large amounts of data is required. Embodiments related to statistical data verification are described below with reference to FIGS. 4 and 5.

도 4는 본 개시의 일 실시 예에 따른 베이스라인 생성기 및 데이터 품질 검증기와 연관된 데이터 흐름을 나타내는 예시적인 아키텍처를 보여준다. 도 4를 참조하면, 도 3a에서 설명한 통계 데이터 생성을 위한 데이터 파이프라인을 포함하는 아키텍처(410) 및 통계 데이터의 데이터 품질을 검증하기 위한 아키텍처(420)가 도시된다.4 shows an example architecture representing data flows associated with a baseline generator and a data quality verifier according to an embodiment of the present disclosure. Referring to FIG. 4, an architecture 410 including a data pipeline for generating statistical data described in FIG. 3A and an architecture 420 for verifying data quality of statistical data are shown.

일 실시 예에서, 전자 장치(100)는 주문 데이터를 기반으로 통계 데이터를 검증하기 위한 베이스라인 데이터를 생성할 수 있다. 전자 장치(100)는 주문 데이터의 원본 데이터(golden truth data)로부터 베이스라인 데이터를 생성할 수 있고, 원본 데이터는 배치 파이프라인(batch pipeline)에 의해 사용되는 DBMS 덤프(dump) 및 실시간 파이프라인에 의해 사용되는 메시지 큐 덤프로부터 추출될 수 있다. 일 실시 예에서, 계산 비용을 줄이기 위해 원본 데이터는 사용자의 식별자를 기반으로 샘플링될 수 있다.In one embodiment, the electronic device 100 may generate baseline data for verifying statistical data based on order data. The electronic device 100 may generate baseline data from the original data (golden truth data) of the order data, and the original data may be stored in a DBMS dump used by a batch pipeline and a real-time pipeline. It can be extracted from a message queue dump used by In one embodiment, the original data may be sampled based on the user's identifier to reduce computational cost.

일 실시 예에서, 베이스라인 데이터를 생성하는 것은 주문 데이터를 기반으로 베이스라인 데이터를 검증하는 것을 더 포함할 수 있다. 예를 들어, 실시간 파이프라인에 의해 사용되는 메시지 큐 덤프에서도 데이터가 중복되거나, 또는 메시지 큐 덤프에서 원본 데이터를 추출하는 과정에서 충돌이 발생하여 원본 데이터가 100% 정확한 데이터임을 보장할 수 없기 때문에, 전자 장치(100)는 주문 데이터를 기반으로 생성된 베이스라인 데이터에 대해서도 데이터 품질 검증을 수행할 수 있다. In one embodiment, generating baseline data may further include verifying the baseline data based on order data. For example, data may be duplicated in the message queue dump used by the real-time pipeline, or a conflict may occur during the process of extracting the original data from the message queue dump, so the original data cannot be guaranteed to be 100% accurate. The electronic device 100 may also perform data quality verification on baseline data generated based on order data.

일 실시 예에서, 전자 장치(100)는 베이스라인 데이터 및 통계 데이터 각각의 증분값을 비교하고, 비교를 기반으로 통계 데이터를 검증할 수 있다. 예를 들어, 베이스라인 데이터 및 통계 데이터는 주문량을 포함할 수 있고, 베이스라인 데이터 및 통계 데이터의 증분값은 오늘까지의 주문량인 "value(till today)"에서 어제까지의 주문량인 "value(till yesterday)"을 뺀 값, 즉, "value(till today) - value(till yesterday)"로 계산될 수 있다. 이와 같이, 전자 장치(100)는 시점 T에서 확인된 데이터 값을 기준으로, T~T+1 사이의 시간 단위로 데이터의 증분값을 추적하고, 추적된 증분값들에 대한 정확성을 검증함으로써 시점 T+1에서 확인되는 데이터 값이 정확한지를 판단할 수 있으며, 이를 통해 시점 0에서 시점 T+1까지의 모든 데이터에 대한 검증 절차를 거칠 필요 없이 대용량의 데이터를 효율적으로 검증할 수 있다.In one embodiment, the electronic device 100 may compare the incremental values of each of the baseline data and the statistical data and verify the statistical data based on the comparison. For example, the baseline data and statistical data may include order quantities, and the incremental values of the baseline data and statistical data range from "value(till today)," which is the order quantity until today, to "value(till," which is the order quantity until yesterday. yesterday)", that is, it can be calculated as "value(till today) - value(till yesterday)". In this way, the electronic device 100 tracks the incremental value of data in time units between T and T+1, based on the data value confirmed at time T, and verifies the accuracy of the tracked increment values to determine the time point T. It is possible to determine whether the data value confirmed at T+1 is accurate, and through this, large amounts of data can be efficiently verified without the need to go through verification procedures for all data from time point 0 to time T+1.

구체적으로, V(n)은 시간 n에서의 계산된 메트릭 값을 나타내고, B(n)은 시간 n에서의 베이스라인 메트릭 값을 나타내며, P(n)은 V(n)의 정확성(correctness)을 나타내고 참 또는 거짓 중 하나의 값을 가질 수 있다. 즉, P(n)이 참인 경우, V(n)=B(n)이다. N의 시간 간격은 시간(hour)일 수 있다. 이때, 전자 장치(100)는 아래와 같은 순서로 P(n)이 모든 시점 n에서 참임을 증명할 수 있다.Specifically, V(n) represents the calculated metric value at time n, B(n) represents the baseline metric value at time n, and P(n) represents the correctness of V(n). It represents and can have a value of either true or false. That is, if P(n) is true, V(n)=B(n). The time interval of N may be hours. At this time, the electronic device 100 can prove that P(n) is true at all times n in the following order.

1) 먼저, P(0)=참이며, 0은 데이터 검증이 시작되는 시간을 나타내고, 이는 시작부터 계산된 메트릭의 모든 이력을 검증할 것임을 나타낸다.1) First, P(0)=True, 0 indicates the time when data verification begins, and this indicates that all history of the calculated metric will be verified from the beginning.

2) 모든 k>=n+1에 대해, P(k)가 참이면 P(k+1)도 참임을 증명한다. 특정 k에 대해, n=k일 때 P(k)가 참이라고 가정하면, k<j<=k+1에 대해 B(j) 및 V(j)를 비교하고, 이 구간에서의 모든 값이 동일한 경우, P(k+1)도 참이 된다.2) Prove that for all k>=n+1, if P(k) is true, then P(k+1) is also true. For a particular k, assuming P(k) is true when n=k, compare B(j) and V(j) for k<j<=k+1, and all values in this interval are In the same case, P(k+1) is also true.

3) 위 1) 및 2)가 참으로 증명된 경우, 수학적 귀납법에 의해 P(n)은 모든 시점 n에서 참이 된다.3) If 1) and 2) above are proven to be true, P(n) becomes true at all times n by mathematical induction.

위와 같은 수학적 귀납법을 통해, 전자 장치가 시점 T에 데이터 품질 검사를 했다면 시점 T 이전까지의 데이터의 정확성이 증명된 것이고, 시점 T+1에 데이터 품질 검사를 하기 위해서는 시점 T 내지 T+1 사이의 증분값에 대해서만 데이터 품질 검사를 수행하면 시점 T+1까지의 데이터의 정확성이 증명될 수 있으며, 이에 따라 불필요한 계산량의 증가 없이 전자 장치(100)는 모든 시구간의 데이터의 무결성을 검증할 수 있다.Through the above mathematical induction method, if the electronic device performs a data quality check at time T, the accuracy of the data up to time T has been proven, and in order to perform a data quality check at time T+1, the data quality check between time T and T+1 has been proven. By performing a data quality check only on the incremental values, the accuracy of the data up to time T+1 can be proven, and thus the electronic device 100 can verify the integrity of the data in all time periods without an unnecessary increase in the amount of calculation.

일 실시 예에서, 증분값은 통계 데이터를 기반으로 설정된 주기로 계산될 수 있다. 예를 들어, 증분값을 계산하는 주기(예를 들어, 1시간, 1일, 일주일)는 통계 데이터(예를 들어, 주문하는 아이템의 종류, 사용자의 주문 빈도 등)에 따라 적응적으로 결정될 수 있다. 사용자의 주문량에 대한 통계 데이터의 경우, 아이템의 특성에 따라 사용자가 자주 주문하는 아이템의 주문량은 증분값을 계산하는 주기를 짧게 설정하고(예를 들어, 1일), 사용자가 자주 주문하지 않는 아이템의 주문량은 증분값을 계산하는 주기를 길게 설정함으로써(예를 들어, 일주일), 통계 데이터의 데이터 검증에 필요한 계산량을 효율적으로 감소시킬 수 있다..In one embodiment, the incremental value may be calculated at a set period based on statistical data. For example, the cycle for calculating increment values (e.g., 1 hour, 1 day, week) can be adaptively determined based on statistical data (e.g., type of item being ordered, frequency of user ordering, etc.) there is. In the case of statistical data on the user's order quantity, depending on the characteristics of the item, the cycle for calculating the increment value is set short (for example, 1 day) for the order quantity of the item that the user frequently orders, and for the item that the user does not frequently order By setting a long cycle for calculating the incremental value (for example, one week), the amount of calculations required for data verification of statistical data can be efficiently reduced.

도 5a 및 도 5b는 본 개시의 일 실시 예에 따른 데이터 품질 검증기의 동작을 나타내는 예시적인 도면이다. 5A and 5B are exemplary diagrams showing the operation of a data quality verifier according to an embodiment of the present disclosure.

도 5a를 참조하면, 데이터 품질 검증기(data quality validator, 510)를 포함하는 아키텍처(500)가 도시된다. 데이터 품질 검증기(510)는 도 4의 아키텍처(420)에 포함된 데이터 품질 검증기(DQ Validator)를 포함할 수 있다. 데이터 품질 검증기(510)는 베이스라인 생성기로부터 생성된 베이스라인 데이터와 메트릭 덤프로부터 획득한 통계 데이터를 비교하여 데이터의 정확성을 측정할 수 있다. 일 실시 예에서, 데이터 품질 검증기(510)는 스파크 잡(Spark Job)에 의해 구동될 수 있다.Referring to Figure 5A, an architecture 500 including a data quality validator 510 is shown. The data quality verifier 510 may include a data quality verifier (DQ Validator) included in the architecture 420 of FIG. 4 . The data quality verifier 510 can measure the accuracy of the data by comparing baseline data generated from the baseline generator and statistical data obtained from the metric dump. In one embodiment, the data quality verifier 510 may be driven by a Spark Job.

도 5b를 참조하면, 사용자에 의해 설정된 제약(constraint)을 만족하는지 여부를 검사할 수 있는 코드(520)가 도시된다. 도 5b에 도시된 예시에서, 데이터 품질 검증기(510)는 코드(520)를 통해 검사하고자 하는 통계 데이터인 하루 동안의 프레시 주문량을 키로 설정하고("order-fresh-day-amount"), 제약으로 값이 설정된 범위 내에 있는지(예를 들어, 0 내지 100000), 크기가 전날 주문량에 비해 차이가 많이 나는지(예를 들어, 전날 주문량의 0.9 내지 1.1배 범위 내에 있는지), 얼마나 일치하는지(예를 들어, 99%의 일치도)를 검사할 수 있다. 데이터 품질 검증기(510)는 통계 데이터 및 베이스라인 데이터(또는 그들의 증분값)의 비교 결과 상기 제약 중 적어도 하나가 만족되지 않는다면 관리자 단말로 알림 메시지를 전송할 수 있다. Referring to FIG. 5B, a code 520 that can check whether constraints set by the user are satisfied is shown. In the example shown in Figure 5b, the data quality verifier 510 sets the fresh order amount for one day, which is statistical data to be inspected, as a key ("order-fresh-day-amount") through code 520, and sets it as a constraint. Whether the value is within a set range (e.g., 0 to 100000), whether the size differs significantly compared to the previous day's order quantity (e.g., within a range of 0.9 to 1.1 times the previous day's order quantity), and how consistent it is (e.g. , 99% agreement) can be tested. The data quality verifier 510 may transmit a notification message to the administrator terminal if at least one of the constraints is not satisfied as a result of comparing statistical data and baseline data (or their incremental values).

도 6은 본 개시의 일 실시 예에 따른 전자 장치의 데이터 검증 방법의 흐름을 나타내는 순서도이다. 도 6에 도시된 단계들은 도 1에 도시된 전자 장치(100)에 의해 수행될 수 있으며, 전술한 내용과 중복되는 설명이 이하에서는 생략될 수 있다.Figure 6 is a flowchart showing the flow of a data verification method for an electronic device according to an embodiment of the present disclosure. The steps shown in FIG. 6 may be performed by the electronic device 100 shown in FIG. 1, and descriptions that overlap with the above may be omitted below.

단계 S610에서, 전자 장치는 주문 데이터를 기반으로 주문에 대한 통계 데이터를 생성할 수 있다. 일 실시 예에서, 통계 데이터는 엔티티(예를 들어, 사용자, 카테고리, 프로모션)의 전자상거래 서비스와 관련된 특성을 나타내는 프로파일(profile) 또는 메트릭(metric)을 포함할 수 있으며, 예를 들어, 직접 사용될 수 있는 엔티티의 속성(예를 들어, 회원 가입 날짜), 특정 이벤트 타입에 대한 집계 데이터, 예측 메트릭 등을 포함할 수 있다. 집계 데이터는 로그인, 클릭 횟수, 접속 시간, 단일 상세 페이지(single detail page; SDP 조회 횟수와 같은 행동 통계 및 주문 횟수, GMV(gross merchandise volume, 총 판매 가치), 주문 금액, 주문량의 총합 등과 같은 거래 통계를 포함할 수 있다. 예측 메트릭은 예를 들어, 사용자의 "아이폰"에 대한 관심과 관련된 예측 스코어를 포함할 수 있다. 전자상거래 업체는 엔티티의 전자상거래 서비스와 관련된 분석 및 의사 결정을 위해 이와 같은 통계 데이터를 수집 및 생성할 필요가 있다.In step S610, the electronic device may generate statistical data about the order based on the order data. In one embodiment, statistical data may include a profile or metric representing characteristics related to an e-commerce service of an entity (e.g., user, category, promotion) and may be used directly, for example. It may include attributes of an entity (e.g., membership sign-up date), aggregated data for specific event types, predictive metrics, etc. Aggregated data includes behavioral statistics such as logins, number of clicks, access times, single detail page (SDP) views, and transactions such as number of orders, gross merchandise volume (GMV), order value, and total order volume. Statistics may include. Predictive metrics may include, for example, a predictive score related to a user's interest in “iPhone.” An e-commerce business may use these for analysis and decision-making related to the entity's e-commerce services. There is a need to collect and generate statistical data for the same.

단계 S620에서, 전자 장치는 주문 데이터를 기반으로 통계 데이터를 검증하기 위한 베이스라인 데이터를 생성할 수 있다. 전자 장치는 주문 데이터의 원본 데이터로부터 베이스라인 데이터를 생성할 수 있고, 원본 데이터는 배치 파이프라인에 의해 사용되는 DBMS 덤프 및 실시간 파이프라인에 의해 사용되는 메시지 큐 덤프로부터 추출될 수 있다. 일 실시 예에서, 계산 비용을 줄이기 위해 원본 데이터는 사용자의 식별자를 기반으로 샘플링될 수 있다.In step S620, the electronic device may generate baseline data for verifying statistical data based on the order data. The electronic device may generate baseline data from original data of the order data, and the original data may be extracted from a DBMS dump used by the batch pipeline and a message queue dump used by the real-time pipeline. In one embodiment, the original data may be sampled based on the user's identifier to reduce computational cost.

단계 S630에서, 전자 장치는 베이스라인 데이터 및 통계 데이터 각각의 증분값을 비교할 수 있다. 단계 S640에서, 전자 장치는 증분값의 비교를 기반으로 통계 데이터를 검증할 수 있다. 예를 들어, 베이스라인 데이터 및 통계 데이터는 주문량을 포함할 수 있고, 베이스라인 데이터 및 통계 데이터의 증분값은 오늘까지의 주문량인 "value(till today)"에서 어제까지의 주문량인 "value(till yesterday)"을 뺀 값, 즉, "value(till today) - value(till yesterday)"로 계산될 수 있다. 이와 같이, 전자 장치(100)는 시점 T에서 확인된 데이터 값을 기준으로, T~T+1 사이의 시간 단위로 데이터의 증분값을 추적하고, 추적된 증분값들에 대한 정확성을 검증함으로써 시점 T+1에서 확인되는 데이터 값이 정확한지를 판단할 수 있으며, 이를 통해 시점 0에서 시점 T+1까지의 모든 데이터에 대한 검증 절차를 거칠 필요 없이 대용량의 데이터를 효율적으로 검증할 수 있다.In step S630, the electronic device may compare the incremental values of each of the baseline data and statistical data. In step S640, the electronic device may verify statistical data based on comparison of incremental values. For example, the baseline data and statistical data may include order quantities, and the incremental values of the baseline data and statistical data range from "value(till today)," which is the order quantity until today, to "value(till," which is the order quantity until yesterday. yesterday)", that is, it can be calculated as "value(till today) - value(till yesterday)". In this way, the electronic device 100 tracks the incremental value of data in time units between T and T+1, based on the data value confirmed at time T, and verifies the accuracy of the tracked increment values to determine the time point T. It is possible to determine whether the data value confirmed at T+1 is accurate, and through this, large amounts of data can be efficiently verified without the need to go through verification procedures for all data from time point 0 to time T+1.

도 7a 내지 도 7c는 본 개시의 복수의 실시 예에 따른 통계 데이터를 생성하기 위한 디자인을 설명하기 위한 예시적인 도면이다.7A to 7C are exemplary diagrams for explaining a design for generating statistical data according to a plurality of embodiments of the present disclosure.

도 7a를 참조하면, 본 개시의 일 실시 예에 따른 전자 장치(100)가 데이터 소스로부터 메트릭을 생성하기 위한 데이터 파이프라인 아키텍처(710)가 도시된다. 메트릭은 오늘의 메트릭과 오늘 이전의 메트릭으로 나뉘어질 수 있고, 오늘의 메트릭은 스트리밍 처리 엔진(예를 들어, 스파크 스트리밍(Spark Streaming) 애플리케이션)을 통해 직접 계산될 수 있으며, 오늘 이전의 메트릭은 배치(batch)로 계산되어 키-값(key-value) 스토어에 저장될 수 있다. 최종 메트릭은 실시간 메트릭 및 배치 메트릭을 조합함으로써 생성될 수 있다. 주문 데이터와 같이 상태가 변할 수 있는 소스 데이터에 대해, 전자 장치(100)는 레코드의 상태 변화를 처리하기 위해 반작용을 사용할 수 있다. 도 7a에 도시된 바와 같이, 가장 최근의 상태 레코드를 기록하기 위한 데이터베이스(Database, 711)가 추가될 수 있으며, 일 실시 예에서, 데이터베이스는 키-값 읽기 및 쓰기를 지원할 수 있는 카산드라 데이터베이스 관리 시스템에 의해 관리되는 데이터베이스를 포함할 수 있다. 주문을 예로 들면, 데이터 파이프라인은 모든 새로 처리된 소스 주문 이벤트를 데이터베이스(711)에 저장하고 이전 이벤트가 있다면 이를 교체할 수 있다. 수신된 모든 소스 주문 이벤트에 대해, 주문이 새로 생성되지 않았다면, 파이프라인은 그 주문의 이전 레코드를 획득하여 반작용 값을 계산할 수 있다. 예를 들어, 주문이 취소되는 경우 파이프라인은 이전 주문량의 -1을 곱한 값을 다운스트림으로 전송할 수 있다.Referring to FIG. 7A, a data pipeline architecture 710 is shown for the electronic device 100 to generate a metric from a data source according to an embodiment of the present disclosure. Metrics can be split into today's metrics and metrics from before today, today's metrics can be calculated directly through a streaming processing engine (e.g., a Spark Streaming application), and metrics before today can be calculated directly from the batch. It can be calculated in batches and stored in a key-value store. The final metric can be generated by combining real-time metrics and batch metrics. For source data whose state may change, such as order data, electronic device 100 may use reactions to process changes in the state of the record. As shown in FIG. 7A, a database 711 may be added to record the most recent status record, and in one embodiment, the database is a Cassandra database management system capable of supporting key-value reading and writing. It may include databases managed by . Taking orders as an example, the data pipeline can store all newly processed source order events in the database 711 and replace old events, if any. For every source order event received, if the order was not newly created, the pipeline can obtain the previous record of that order and calculate the reaction value. For example, if an order is cancelled, the pipeline could send downstream the previous order quantity multiplied by -1.

도 7b를 참조하면, 본 개시의 일 실시 예에 따른 전자 장치(100)가 데이터 소스로부터 메트릭을 생성하기 위한 데이터 파이프라인 아키텍처(720)가 도시된다. 도 7b에 도시된 바와 같이, 메트릭 계산(Metric Calculation) 모듈(721)은 비즈니스 이벤트 세부사항을 먼저 저장한 후, 저장된 세부 이벤트를 기반으로 필요한 메트릭을 생성할 수 있다. 아키텍처(720)는 메트릭 계산 모듈에서 재계산 기간 동안 세부 이벤트를 집계한다는 것이 특징인데, 이는 메트릭 계산 속도를 향상시키고 저장 비용을 감소시키는 장점이 있으나, 구현 복잡성이 증가하고 기 정의된 집계 범위를 넘어 새로운 메트릭을 추가하는 것이 어렵다는 단점이 있다.Referring to FIG. 7B, a data pipeline architecture 720 is shown for the electronic device 100 to generate a metric from a data source according to an embodiment of the present disclosure. As shown in FIG. 7B, the metric calculation module 721 may first store business event details and then generate necessary metrics based on the stored detailed events. The architecture 720 is characterized by aggregating detailed events during the recalculation period in the metric calculation module, which has the advantage of improving metric calculation speed and reducing storage costs, but increases implementation complexity and exceeds the predefined aggregation range. The downside is that it is difficult to add new metrics.

도 7c를 참조하면, 본 개시의 일 실시 예에 따른 전자 장치(100)가 데이터 소스로부터 메트릭을 생성하기 위한 데이터 파이프라인 아키텍처(730)가 도시된다. 도 7c에 도시된 바와 같이, 메트릭 계산 모듈(731)은 비즈니스 이벤트를 파인 그레인드 레벨의 사전 집계된 포맷으로 집계하고 그 데이터를 먼저 저장할 수 있다. 그 후, 메트릭 계산 모듈(731)은 실시간 퍼블리셔(732)에 알림을 전송하고 실시간 퍼블리셔(732)가 사전 집계된 데이터를 최종 메트릭으로 집계하도록 할 수 있다. 주문 데이터와 같이 상태가 변할 수 있는 소스 데이터에 대해, 전자 장치(100)는 소스 이벤트의 상태 변화를 처리하기 위해 반작용을 사용할 수 있다. 아키텍처(730)는 세부 비즈니스 이벤트를 메트릭 계산 모듈이 아닌 데이터 변환 모듈의 데이터베이스(733)에 저장한다는 것인데, 이는 생성된 반작용 이벤트가 일반적인 비즈니스 이벤트로 처리될 수 있기 때문이다.Referring to FIG. 7C, a data pipeline architecture 730 is shown for the electronic device 100 to generate a metric from a data source according to an embodiment of the present disclosure. As shown in Figure 7C, the metric calculation module 731 may aggregate business events into a fine grained level pre-aggregated format and first store the data. The metric calculation module 731 may then send a notification to the real-time publisher 732 and cause the real-time publisher 732 to aggregate the pre-aggregated data into final metrics. For source data whose state may change, such as order data, the electronic device 100 may use a reaction to process the state change of the source event. The architecture 730 stores detailed business events in the database 733 of the data transformation module rather than the metric calculation module, because the generated reaction event can be processed as a general business event.

한편, 본 명세서와 도면에는 본 발명의 바람직한 실시 예에 대하여 개시하였으며, 비록 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예 외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다.Meanwhile, the specification and drawings disclose preferred embodiments of the present invention, and although specific terms are used, they are used in a general sense to easily explain the technical content of the present invention and to aid understanding of the present invention. It is not intended to limit the scope of the invention. In addition to the embodiments disclosed herein, it is obvious to those skilled in the art that other modifications based on the technical idea of the present invention can be implemented.

100: 전자 장치
110: 프로세서
120: 메모리100: electronic device
110: processor
120: memory

Claims

A method for verifying data in an electronic device, comprising:
generating statistical data about orders based on order data;
generating baseline data to verify the statistical data based on the order data;
Comparing incremental values of each of the baseline data and the statistical data; and
Verifying the statistical data based on comparison of the incremental values,
The statistical data is generated through a data pipeline that ensures exactly once processing of the order data,
The data pipeline includes a first pipeline that guarantees at least once processing and a second pipeline that guarantees exactly once processing,
The output of the first pipeline includes duplicate data associated with the order data,
A data verification method for an electronic device, wherein deduplication of the duplicate data is performed in the second pipeline.

According to paragraph 1,
The order data includes an order identifier and order quantity for one or more orders,
The data verification method of an electronic device, wherein the statistical data includes the total amount of the order.

According to paragraph 2,
The step of generating the statistical data is
Receiving an order cancellation event;
Confirming the quantity of orders canceled from the order cancellation event; and
A data verification method for an electronic device, further comprising adding the canceled order quantity multiplied by -1 to the statistical data.

delete

According to paragraph 1,
The step of generating the statistical data is
When R(n+1) representing the n+1 change of the order data is input to the first pipeline:
retrieving R(n) representing the nth change of the order data from the cache;
Multiplying the value of R(n) by -1;
transmitting -R(n) and R(n+1) to the second pipeline; and
Caching the R(n+1)
Including, a data verification method of an electronic device.

According to clause 6,
The step of generating the statistical data is
A data verification method for an electronic device, further comprising generating the statistical data by aggregating the -R(n) and R(n+1) in the second pipeline.

According to clause 6,
Wherein R(n) includes an order ID and an order version.

According to paragraph 1,
The step of generating the baseline data is
Data verification method of an electronic device, further comprising verifying the baseline data based on the order data.

According to paragraph 1,
A data verification method for an electronic device, wherein the increment value is calculated at a set cycle based on the statistical data.

An electronic device for verifying data, comprising:
a memory storing at least one instruction; and
By executing the at least one command, statistical data for an order is generated based on order data, baseline data for verifying the statistical data is generated based on the order data, and the baseline data and the statistics are generated. A processor that compares the increment values of each data and verifies the statistical data based on the comparison of the increment values,
The statistical data is generated through a data pipeline that ensures exactly once processing of the order data,
The data pipeline includes a first pipeline that guarantees processing at least once and a second pipeline that guarantees processing exactly once,
The output of the first pipeline includes duplicate data associated with the order data,
An electronic device for verifying data, wherein deduplication of the duplicate data is performed in the second pipeline.

A non-transitory computer-readable storage medium, comprising:
A medium configured to store computer readable instructions,
The computer-readable instructions, when executed by a processor, cause the processor to:
generating statistical data about orders based on order data;
generating baseline data to verify the statistical data based on the order data;
comparing incremental values of each of the baseline data and the statistical data; and
Performing a data verification method for an electronic device, comprising verifying the statistical data based on comparison of the incremental values,
The statistical data is generated through a data pipeline that ensures exactly once processing of the order data,
The data pipeline includes a first pipeline that guarantees processing at least once and a second pipeline that guarantees processing exactly once,
The output of the first pipeline includes duplicate data associated with the order data,
A non-transitory computer-readable storage medium, wherein deduplication of the redundant data is performed in the second pipeline.