KR102448694B1

KR102448694B1 - Systems and related methods and devices for predictive data analysis

Info

Publication number: KR102448694B1
Application number: KR1020197014598A
Authority: KR
Inventors: 제레미 아친; 토마스 데고도이; 사비에르 코노트; 세르게이 유르겐슨; 마크 엘. 스테드먼; 글렌 카운드리; 피터 프리튼호퍼; 혼 니안 추아
Original assignee: 데이터로봇, 인크.
Priority date: 2016-10-21
Filing date: 2017-10-21
Publication date: 2022-09-28
Also published as: WO2018075995A1; JP2021012734A; GB2606674B; GB2571651A; GB202211852D0; SG10202104185UA; GB2571651B; GB2606674A; AU2017345796A1; KR20190108559A; GB201907147D0; JP2019537125A; JP7107926B2; EP3529755A1

Abstract

예측 데이터 분석 기술은 시계열 예측 문제에 대한 예측 모델 생성을 위한 개선된 기술, 입력 변수("피처")의 예측값을 결정하기 위한 개선된 기술, 및 1차 예측 모델의 2차 모델을 생성하기 위한 개선된 기술을 포함할 수 있다.Predictive data analysis techniques are improved techniques for generating predictive models for time-series prediction problems, improved techniques for determining predicted values of input variables (“features”), and improved techniques for generating secondary models of primary predictive models. technology may be included.

Description

Systems and related methods and devices for predictive data analysis

<관련 출원에 대한 상호 참조><Cross-Reference to Related Applications>

본 출원은 발명의 명칭이 "피처의 예측값을 결정하기 위한 시스템 및 기술(Systems and Techniques for Determining the Predictive Value of a Feature)"이고 2016년 10월 21일자로 출원되고 대리인 정리 번호가 DRB-001C1CP인 미국 특허 출원 제15/331,797호 및 발명의 명칭이 "예측 데이터 분석을 위한 시스템 및 기술"이고 2016년 10월 21일자로 출원되고 대리인 정리 번호가 DRB-002PR인 미국 가특허 출원 제62/411,526호와 관련되며, 그 각각은 적용 가능한 법에 의해 허용되는 최대 범위까지 본원에 참조로 통합된다.This application is entitled "Systems and Techniques for Determining the Predictive Value of a Feature" and was filed on October 21, 2016 and has Attorney's Docket No. DRB-001C1CP. U.S. Patent Application Serial No. 15/331,797 and U.S. Provisional Patent Application No. 62/411,526 entitled "Systems and Techniques for Predictive Data Analysis" filed on October 21, 2016 and having Attorney Docket No. DRB-002PR to the maximum extent permitted by applicable law, each of which is incorporated herein by reference.

<발명의 분야><Field of Invention>

본 발명은 일반적으로 데이터 분석을 위한 시스템 및 기술에 관한 것이다. 일부 실시예는 구체적으로 예측 문제에 대한 예측 모델을 개발, 선택 및/또는 이해하기 위해 통계적 학습 방법을 사용하기 위한 시스템 및 기술에 관한 것이다.FIELD OF THE INVENTION The present invention relates generally to systems and techniques for data analysis. Some embodiments specifically relate to systems and techniques for using statistical learning methods to develop, select, and/or understand predictive models for predictive problems.

많은 조직과 개인은 그 운영을 개선하거나 그 의사-결정을 돕기 위해 전자 데이터를 사용한다. 예를 들어, 많은 비즈니스 기업은 트랜잭션 실행, 입력 및 출력 추적 또는 제품 마케팅과 같은 다양한 비즈니스 프로세스의 효율성을 향상시키기 위해 데이터 관리 기술을 사용한다. 다른 예로서, 많은 비즈니스는 비즈니스 프로세스의 성과를 평가하거나, 프로세스를 개선하는 노력의 효과를 측정하거나 어떻게 프로세스를 조정해야 할지를 결정하기 위해 운영 데이터를 사용한다.Many organizations and individuals use electronic data to improve their operations or aid their decision-making. For example, many business enterprises use data management technologies to improve the efficiency of various business processes, such as executing transactions, tracking inputs and outputs, or marketing products. As another example, many businesses use operational data to evaluate the performance of business processes, to measure the effectiveness of efforts to improve processes, or to determine how processes should be adjusted.

일부 경우에, 전자 데이터가 문제 또는 기회를 예측하는 데 사용될 수 있다. 일부 조직은 예측 모델을 구축하기 위해 과거에 발생한 것을 설명하는 운영 데이터와 실적 통계의 후속값을 설명하는 평가 데이터를 조합한다. 예측 모델에 의해 예측된 결과에 기초하여, 조직은 결정을 내리거나, 프로세스를 조정하거나 다른 액션을 취할 수 있다. 예를 들어, 보험 회사는 미래의 보험금을 보다 정확하게 예상하는 예측 모델, 또는 보험 계약자가 경쟁 보험사로의 전환을 고려하는 때를 예측하는 예측 모델을 구축하려고 시도할 수 있다. 자동차 제조업체는 신차 모델에 대한 수요를 보다 정확하게 예상하는 예측 모델을 구축하려고 시도할 수 있다. 소방서는 화재 위험이 높은 날짜를 예상하거나 어떠한 구조물이 화재로 인해 위험에 처할지를 예측하는 예측 모델을 구축하려고 시도할 수 있다.In some cases, electronic data may be used to predict problems or opportunities. To build predictive models, some organizations combine operational data that describes what happened in the past with evaluation data that describes the successor values of performance statistics. Based on the results predicted by the predictive model, the organization can make decisions, adjust processes, or take other actions. For example, an insurance company may attempt to build a predictive model that more accurately predicts future claims, or when a policyholder considers switching to a competing insurer. Automakers may try to build predictive models that more accurately predict demand for new car models. Fire departments may try to build predictive models to predict high fire risk days or to predict which structures will be at risk from a fire.

머신-학습 기술(machine-learning technique)(예를 들어, 감독 통계-학습 기술)은 적어도 2개의 변수의 이전에 기록된 관측치를 포함하는 데이터 세트로부터 예측 모델을 생성하는 데 사용될 수 있다. 예측될 변수(들)는 "타겟(들)", "반응(들)" 또는 "종속 변수(들)"로 칭해질 수 있다. 예측하는 데 사용할 수 있는 나머지 변수(들)는 "피처(들)", "예측자(들)" 또는 "독립 변수(들)"로 칭해질 수 있다. 관측치는 일반적으로 적어도 하나의 "트레이닝" 데이터 세트와 적어도 하나의 "테스트" 데이터 세트로 분할된다. 그 후, 데이터 분석자는 통계-학습 절차를 선택하고 예측 모델을 생성하기 위해 트레이닝 데이터 세트에 대해 그 절차를 실행한다. 그 후, 분석자는, 모델이 타겟(들)의 실제 관측에 대해 타겟(들)의 값(들)이 얼마나 잘 예측하는지를 결정하기 위해 테스팅 데이터 세트에 대해 생성된 모델을 테스트한다.A machine-learning technique (eg, a supervised statistical-learning technique) may be used to generate a predictive model from a data set comprising previously recorded observations of at least two variables. The variable(s) to be predicted may be referred to as “target(s)”, “response(s)” or “dependent variable(s)”. The remaining variable(s) that can be used to make predictions may be referred to as “feature(s)”, “predictor(s)” or “independent variable(s)”. Observations are generally partitioned into at least one "training" data set and at least one "test" data set. The data analyst then selects a statistic-learning procedure and runs the procedure on the training data set to create a predictive model. The analyst then tests the generated model against the testing data set to determine how well the model predicts the value(s) of the target(s) relative to the actual observation of the target(s).

일부 실시예에 대한 동기Motivation for some embodiments

데이터 분석자는 분석 기술 및 연산 인프라스트럭처를 사용하여 연산 및 평가 데이터를 포함하여 전자 데이터로부터 예측 모델을 구축할 수 있다. 데이터 분석자는 일반적으로 예측 모델을 구축하는 2개의 접근법 중 하나를 사용한다. 첫 번째 접근법에서, 예측 문제를 다루는 조직은 단순히 동일한 예측 문제 또는 유사한 예측 문제에 대해 이미 개발된 패키지화된 예측 모델링 해결책을 사용한다. 이 "쿠키 커터(cookie cutter)" 접근법은 저렴하지만 일반적으로 비교적 많은 수의 조직에 공통적인 소수의 예측 문제(예를 들어, 사기 탐지, 불량 관리, 마케팅 응답 등)에 대해서만 가능하다. 두 번째 접근법을 사용하면, 데이터 분석 팀이 예측 문제에 대한 커스텀화된 예측 모델링 해결책을 구축한다. 이 "장인정신(artisanal)" 접근법은 일반적으로 비용이 많이 들고 시간 소모적이므로, 소수의 고-가치 예측 문제에 사용되는 경향이 있다.Data analysts can use analytics techniques and computational infrastructure to build predictive models from electronic data, including computational and evaluation data. Data analysts typically use one of two approaches to building predictive models. In the first approach, organizations dealing with predictive problems simply use packaged predictive modeling solutions already developed for the same or similar predictive problems. This "cookie cutter" approach is inexpensive, but generally only works for a handful of predictive problems common to a relatively large number of organizations (eg, fraud detection, bad management, marketing response, etc.). Using the second approach, data analytics teams build customized predictive modeling solutions to predictive problems. This "artisanal" approach tends to be used for a small number of high-value forecasting problems, as it is generally expensive and time consuming.

예측 문제에 대한 잠재적인 예측 모델링 해결책의 공간은 일반적으로 크고 복잡하다. 통계적 학습 기술은 (예를 들어, 수학, 통계학, 물리학, 공학, 경제학, 사회학, 생물학, 의학, 인공 지능, 데이터 마이닝 등의) 많은 학술 전통과 (예를 들어, 금융, 보험, 소매, 제조, 건강 관리 등의) 많은 상업 분야에서이 애플리케이션에 의해 영향을 받는다. 결과적으로, 다양한 변형 및/또는 튜닝 파라미터를 가질 수 있는 많은 다른 예측 모델링 알고리즘뿐만 아니라, 그 자신의 변형 및/또는 파라미터를 갖는 상이한 사전-프로세싱 및 사후-프로세싱 단계가 존재한다. 잠재적인 예측 모델링 해결책(예를 들어, 사전-프로세싱 단계, 모델링 알고리즘 및 사후-프로세싱 단계의 조합)의 양은 이미 상당히 크고, 연구자가 새로운 기술을 개발함에 따라 빠르게 증가하고 있다.The space of potential predictive modeling solutions to predictive problems is usually large and complex. Statistical learning techniques (e.g., mathematics, statistics, physics, engineering, economics, sociology, biology, medicine, artificial intelligence, data mining, etc.) and many academic traditions (e.g., finance, insurance, retail, manufacturing, Many commercial sectors (such as health care) are affected by this application. As a result, there are many other predictive modeling algorithms that may have various transformations and/or tuning parameters, as well as different pre-processing and post-processing steps with their own transformations and/or parameters. The amount of potential predictive modeling solutions (eg, a combination of pre-processing steps, modeling algorithms and post-processing steps) is already quite large and is growing rapidly as researchers develop new techniques.

예측 모델링 기술의 이러한 방대한 공간을 감안할 때, 예측 모델을 생성하는 데 대한 장인정신 접근법은 시간 소모적이고 모델링 검색 공간의 많은 부분을 탐색하지 않는 경향이 있다. 분석자들은 직관이나 이전 경험 및 광범위한 시행착오 테스트를 기반으로 임시적인 방식으로 모델링 공간을 탐색하는 경향이 있다. 그들은 탐색의 잠재적으로 유용한 길을 추구하지 않거나 초기 노력의 결과에 대한 응답으로 검색(search)을 적절하게 조정하지 않을 수 있다. 또한, 시행착오의 범위는 분석자의 시간 제약에 의해 제한되는 경향이 있어, 장인정신 접근법은 일반적으로 모델링 검색 공간의 일부만을 탐색한다.Given this vast space of predictive modeling techniques, craftsmanship approaches to creating predictive models tend to be time consuming and not explore large portions of the modeling search space. Analysts tend to explore the modeling space in an ad hoc fashion, based on intuition or previous experience and extensive trial and error testing. They may not pursue a potentially useful path of search or may not properly tailor the search in response to the results of the initial effort. Also, the scope of trial and error tends to be limited by the analyst's time constraints, so craftsmanship approaches typically only explore a fraction of the modeling search space.

장인정신 접근법은 또한 매우 비쌀 수 있다. 장인정신 접근법을 통해 예측 모델을 개발하면 종종 컴퓨팅 자원과 고가의(well-paid) 데이터 분석자에 상당한 투자가 수반된다. 이러한 상당한 비용을 감안할 때, 조직은, 비용이 적게 들지만 광대한 예측 모델링 공간의 단지 작은 부분(예를 들어, 모델링 공간의 일부 특정 예측 문제에 대한 수용 가능한 해결책을 포함하는 것으로 선험적으로 예상되는 부분)만을 탐색하는 경향이 있는 쿠키 커터 방식을 선호하여 장인정신 접근법을 삼가하고 있다. 쿠키 커터 접근법은 미탐색 옵션에 비해 성능이 떨어지는 예측 모델을 생성할 수 있다.The craftsmanship approach can also be very expensive. Developing predictive models through a craftsmanship approach often entails significant investments in computing resources and well-paid data analysts. Given these significant costs, organizations are expected to have a low cost but only a small fraction of the vast predictive modeling space (e.g., the portion of the modeling space a priori expected to contain acceptable solutions to some particular predictive problem). It avoids the craftsmanship approach in favor of the cookie-cutter approach, which tends to explore the bay. The cookie cutter approach can produce predictive models that perform poorly compared to the unexplored option.

예측 문제에 대한 잠재적인 예측 모델링 기술의 공간을 체계적이고 비용 효율적으로 평가하는 도구에 대한 필요성이 존재한다. 여러 방식으로, 예측 모델을 생성하는 기존의 접근법은 귀중한 자원(예를 들어, 석유, 금, 광물, 보석 등)을 탐사하는 것과 유사하다. 탐사는 가치 있는 발견으로 이어질 수 있지만, 이전 결과의 방대한 라이브러리에 기초하여 신중하게 계획된 탐사 굴착 또는 시추와 조합된 지질 탐사보다 훨씬 효율적이지 않다. 본 발명자들은 통계적 학습 기술이 예측 문제에 대한 잠재적인 예측 모델링 해결책의 공간을 체계적이고 비용 효율적으로 평가하는 데 사용될 수 있음을 인식하고 이해하고 있다.A need exists for tools to systematically and cost-effectively evaluate the space of potential predictive modeling techniques for predictive problems. In many ways, the traditional approach to creating predictive models is similar to the exploration of valuable resources (eg, oil, gold, minerals, gems, etc.). Exploration can lead to valuable discoveries, but it is not much more efficient than well-planned exploratory excavation or geological exploration in combination with drilling based on a vast library of previous results. The inventors recognize and understand that statistical learning techniques can be used to systematically and cost-effectively evaluate the space of potential predictive modeling solutions to predictive problems.

시계열 예측 모델링Time Series Predictive Modeling

많은 예측 문제는 하나 이상의 과거 시간에서의 하나 이상의 입력 변수("피처")의 값에 기초하여 하나 이상의 장래의 시간에서 하나 이상의 출력 변수("타겟")의 값을 예측하는 문제를 제기한다. 이러한 예측 문제는 "시계열 예측 문제"라고 할 수 있으며, 이러한 문제를 모델링하는 예측 모델은 "시계열 예측 모델" 또는 "시계열 모델"이라고 할 수 있다.Many prediction problems pose the problem of predicting the value of one or more output variables (“targets”) at one or more future times based on the values of one or more input variables (“features”) at one or more past times. Such a prediction problem may be referred to as a "time-series prediction problem", and a prediction model that models such a problem may be referred to as a "time-series prediction model" or a "time-series model".

시계열 모델에 대한 모델링 검색 공간을 엄격하고 효율적으로 탐색하기 위한 기술이 필요하다. 본 발명자들은, 시계열 모델링 절차의 특정 양태, 예를 들어, 모델을 트레이닝시키는 데 사용된 트레이닝 데이터의 양, 입력 변수의 관측치 사이의 시간 간격, 트레이닝 데이터가 커버하는 시기간의 길이, 트레이닝 데이터가 커버하는 시기간의 최근성, 모델에 제공되는 피처 값과 연관된 시간과 모델에 의해 예측되는 타겟 값과 연관 시간 사이의 기간("스킵 범위"), 및 모델이 타겟의 값을 예측하는 기간("예상 범위")을, 명시적으로 파라미터화함으로써 시계열 모델링 검색 공간의 엄격하고 효율적인 탐색(효율적인 트레이닝, 테스트 및 시계열 모델의 비교를 포함함)이 용이해질 될 수 있음을 인식하고 이해하고 있다.Techniques are needed to rigorously and efficiently explore the modeling search space for time series models. We describe certain aspects of the time series modeling procedure, e.g., the amount of training data used to train the model, the time interval between observations of the input variable, the length of time period covered by the training data, and the amount of time the training data covers. The recency of the period, the period between the time associated with the feature value provided to the model and the associated time with the target value predicted by the model ("skip range"), and the period over which the model predicts the value of the target ("expected range") .

일반적으로, 본 명세서에서 설명되는 주제의 혁신적인 양태는 예측 모델링 절차를 수행하는 단계를 포함하는 예측 모델링 방법으로 구현될 수 있으며, 이 예측 모델링 절차: (a) 하나 이상의 데이터 세트들을 포함하는 시계열 데이터를 얻는 단계로서, 각각의 데이터 세트는 복수의 관측치들을 포함하며, 각각의 관측치는 (1) 상기 관측치와 연관된 시간의 표시 및 (2) 하나 이상의 변수들의 각각의 값들을 포함하는, 상기 단계; (b) 상기 시계열 데이터의 시간 간격을 결정하는 단계; (c) 상기 하나 이상의 변수들을 타겟들로서 식별하고, 제로(0) 이상의 다른 변수들을 피처들로서 식별하는 단계; (d) 상기 시계열 데이터에 의해 나타내어지는 예측 문제와 연관된 예상 범위 및 스킵(skip) 범위를 결정하는 단계로서, 상기 예상 범위는 상기 타겟들의 값들이 예측될 기간의 지속 시간을 나타내고, 상기 스킵 범위는 상기 예상 범위에서의 가장 빠른 예측과 연관된 시간과 상기 예상 범위에서의 예측이 기초가 되는 최근 관측치와 연관된 시간 사이의 시간 래그(lag)를 나타내는 단계; (e) 상기 시계열 데이터로부터 트레이닝 데이터를 생성하는 단계로서, 상기 트레이닝 데이터는 상기 데이터 세트들 중 적어도 하나의 데이터 세트의 관측치들의 제1 서브세트를 포함하고, 상기 관측치들의 제1 서브세트는 상기 관측치들의 트레이닝-입력 및 트레이닝-출력 컬렉션들을 포함하고, 상기 트레이닝-입력 및 상기 트레이닝-출력 컬렉션들에서 상기 관측치들과 연관된 시간들은 트레이닝-입력 시간 범위 및 트레이닝-출력 시간 범위에 각각 대응하고, 상기 스킵 범위는 상기 트레이닝-입력 시간 범위의 끝을 상기 트레이닝-출력 시간 범위의 시작으로부터 분리하고, 상기 트레이닝-출력 시간 범위의 지속 시간은 적어도 상기 예상 기간만큼 긴, 상기 단계; (f) 상기 시계열 데이터로부터 테스팅 데이터를 생성하는 단계로서, 상기 테스팅 데이터는 상기 데이터 세트들 중 적어도 하나의 데이터 세트의 관측치들의 제2 서브세트를 포함하고, 상기 관측치들의 제2 서브세트는 상기 관측치들의 테스트-입력 및 테스트-유효성 검증 컬렉션들을 포함하고, 상기 테스트-입력 및 상기 테스트-유효성 검증 컬렉션들에서 상기 관측치들과 연관된 시간들은 테스트-입력 시간 범위 및 테스트-유효성 검증 시간 범위에 각각 대응하고, 상기 스킵 범위는 상기 테스트-입력 시간 범위의 끝을 상기 테스트-유효성 검증 시간 범위의 시작으로부터 분리하고, 상기 테스트-유효성 검증 시간 범위의 지속 시간은 적어도 상기 예상 범위만큼 긴, 상기 단계; (g) 예측 모델을 상기 트레이닝 데이터에 맞춤화하는(fitting) 단계; 및 (h) 맞춤화된 모델을 상기 테스팅 데이터에 대해 테스트하는 단계를 포함한다.In general, innovative aspects of the subject matter described herein may be implemented in a predictive modeling method comprising performing a predictive modeling procedure, the predictive modeling procedure comprising: (a) generating time series data comprising one or more data sets; obtaining, each data set comprising a plurality of observations, each observation comprising (1) an indication of a time associated with the observation and (2) respective values of one or more variables; (b) determining a time interval of the time series data; (c) identifying the one or more variables as targets and zero or more other variables as features; (d) determining an expected range and a skip range associated with a prediction problem represented by the time series data, wherein the expected range represents a duration of time for which values of the targets are to be predicted, and wherein the skip range is indicating a time lag between a time associated with the earliest prediction in the expected range and a time associated with the most recent observation on which the prediction in the expected range is based; (e) generating training data from the time series data, the training data comprising a first subset of observations in at least one of the data sets, the first subset of observations comprising the observations training-input and training-output collections of a range separating an end of the training-input time range from a beginning of the training-output time range, wherein a duration of the training-output time range is at least as long as the expected duration; (f) generating testing data from the time series data, the testing data comprising a second subset of observations in at least one of the data sets, the second subset of observations comprising the observations and test-input and test-validation collections of , wherein the skip range separates the end of the test-input time range from the start of the test-validation time range, and the duration of the test-validation time range is at least as long as the expected range; (g) fitting a predictive model to the training data; and (h) testing the customized model against the testing data.

이 양태의 다른 실시예는 각각 이 방법의 액션을 수행하도록 구성된, 하나 이상의 컴퓨터 저장 디바이스 상에 기록된 대응하는 컴퓨터 시스템, 장치 및 컴퓨터 프로그램을 포함한다. 하나 이상의 컴퓨터의 시스템은 소프트웨어, 펌웨어, 하드웨어 또는 이들의 조합이 동작 중에 있는 시스템 상에 설치되어 시스템이 액션을 수행하게 함으로써 특정 액션을 수행하도록 구성될 수 있다. 하나 이상의 컴퓨터 프로그램은 데이터 프로세싱 장치에 의해 실행될 때 장치로 하여금 액션을 수행하게 하는 명령어를 포함함으로써 특정 액션을 수행하도록 구성될 수 있다.Other embodiments of this aspect include corresponding computer systems, apparatus and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the method. A system of one or more computers may be configured to perform a particular action by having software, firmware, hardware, or a combination thereof installed on the system in operation, causing the system to perform the action. One or more computer programs may be configured to perform a particular action by including instructions that, when executed by the data processing apparatus, cause the apparatus to perform the action.

전술한 실시예 및 다른 실시예는 각각 단독으로 또는 조합하여 하기의 특징들 중 하나 이상을 선택적으로 포함할 수 있다. 일부 실시예에서, 시계열 데이터의 시간 간격은 데이터 세트들 중 적어도 하나에 포함된 관측치들의 적어도 서브세트와 연관된 시간들에 적어도 부분적으로 기초하여 결정된다. 일부 실시예에서, 시계열 데이터의 시간 간격을 결정하는 단계는: 데이터 세트들의 각각에 대해, 상기 데이터 세트의 각각의 시간 간격을 결정하는 단계; 데이터 세트들의 상기 시간 간격들이 불균일한 것으로 결정하는 단계; 및 시계열 데이터의 시간 간격을 상기 데이터 세트들의 시간 간격에 설정하는 단계를 포함한다. 일부 실시예에서, 상기 데이터 세트의 시간 간격을 결정하는 단계는: 상기 데이터 세트에 포함된 연속 관측치들의 하나 이상의 쌍에 대해, 상기 연속 관측치들 사이의 각각의 시기간을 결정하는 단계; 상기 연속 관측치들의 쌍 사이의 시기간이 균일한 것으로 결정하는 단계; 및 상기 데이터 세트의 시간 간격을 상기 연속 관측치들 쌍 사이의 시기간에 설정하는 단계를 포함한다.The foregoing and other embodiments may optionally include one or more of the following features, each alone or in combination. In some embodiments, the time interval of the time series data is determined based at least in part on times associated with at least a subset of observations included in at least one of the data sets. In some embodiments, determining a time interval of time series data comprises: for each of the data sets, determining a respective time interval of the data set; determining that the time intervals of data sets are non-uniform; and setting a time interval of time series data to a time interval of the data sets. In some embodiments, determining the time interval of the data set comprises: for one or more pairs of successive observations included in the data set, determining each time period between the successive observations; determining that the period of time between the pair of consecutive observations is uniform; and setting a time interval of the data set to a time interval between the pair of consecutive observations.

일부 실시예에서, 상기 시계열 데이터의 시간 간격을 결정하는 단계는: 상기 데이터 세트들의 각각에 대해, 상기 데이터 세트의 각각의 시간 간격을 결정하는 단계; 및 상기 데이터 세트들 중 적어도 2개의 상기 시간 간격들이 상이한 것으로 결정하는 단계를 포함하고, 상기 시계열 데이터의 시간 간격은 (1) 상기 데이터 세트들의 각각에 포함된 상기 관측치들의 각각의 부분들, 및/또는 (2) 상기 데이터 세트들의 각각의 상기 각각의 시간 간격들에 적어도 부분적으로 기초하여 결정한다. 일부 실시예에서, 상기 데이터 세트의 상기 각각의 시간 간격을 결정하는 단계는: 상기 데이터 세트에 포함된 연속 관측치들의 각각의 쌍 사이의 각각의 기간을 결정하는 단계; 상기 연속 관측치들 쌍 사이의 상기 기간이 복수의 불균일 지속 시간들을 나타내는 경우, 상기 데이터 세트의 시간 간격이 (1) 상기 불균일 지속 시간들의 각각을 나타내는 상기 연속 관측치들의 쌍의 각각의 부분들, 및/또는 (2) 상기 기간의 지속 시간들에 적어도 부분적으로 기초하여 결정되는 단계; 및 상기 연속 관측치들 쌍 사이의 시기간이 불균일 지속 시간인 경우, 상기 데이터 세트의 시간 간격이 상기 시기간의 각각의 지속 시간인 단계를 포함한다. 일부 실시예에서, 상기 시계열 데이터의 시간 간격은 상기 데이터 세트들의 각각의 상기 시간 간격들 각각의 정수배인 최단 시간 간격이다.In some embodiments, determining the time interval of the time series data comprises: for each of the data sets, determining each time interval of the data set; and determining that the time intervals of at least two of the data sets are different, wherein the time intervals of the time series data include (1) respective portions of the observations included in each of the data sets, and/ or (2) based at least in part on each of the respective time intervals of the data sets. In some embodiments, determining each time interval of the data set comprises: determining each time period between each pair of consecutive observations included in the data set; If the period between the pair of consecutive observations represents a plurality of non-uniform durations, then the time interval of the data set is (1) respective portions of the pair of consecutive observations representing each of the non-uniform durations, and/ or (2) determining based at least in part on durations of the period; and if the period of time between the pair of consecutive observations is a non-uniform duration, then the time interval of the data set is the respective duration of the time period. In some embodiments, the time interval of the time series data is the shortest time interval that is an integer multiple of each of the time intervals of each of the data sets.

일부 실시예에서, 본 발명의 액션은: 각각의 데이터 세트에 대해, 상기 데이터 세트의 시간 간격이 상기 시계열 데이터의 시간 간격보다 짧은 경우, 상기 데이터 세트의 관측치들을 다운-샘플링함으로써, 상기 데이터 세트의 시간 간격을 상기 시계열 데이터의 시간 간격으로 변환하는 단계를 더 포함한다. 일부 실시예에서, 상기 데이터 세트의 관측치들을 다운-샘플링하는 단계는, 상기 데이터 세트에 대응하는 시기간에서 상기 시계열 데이터의 시간 간격의 각각의 인스턴스에 대해: 상기 시계열 데이터의 시간 간격의 상기 각각의 인스턴스에 대응하는 시간들과 연관된 상기 데이터 세트의 모든 관측치들을 식별하는 단계; 집계 관측치를 생성하기 위해 식별된 관측치들을 집계하는 단계; 및 상기 데이터 세트의 식별된 관측치들을 상기 집계 관측치로 대체하는 단계를 포함한다. 일부 실시예에서, 상기 시계열 데이터의 시간 간격의 상기 인스턴스에 대응하는 상기 식별된 관측치들의 수는 상기 시계열 데이터의 시간 간격과 상기 데이터 세트의 시간 간격 사이의 비율과 동등하다. 일부 실시예에서, 상기 식별된 관측치들을 집계하는 단계는, 상기 집합 관측치의 각각의 변수의 값을, (1) 상기 식별된 관측치들의 가장 초기에 포함된 대응하는 변수값, (2) 상기 식별된 관측치들의 가장 최근에 포함된 대응하는 변수값, (3) 상기 식별된 관측치들에 포함된 대응하는 변수값들의 최대값, (4) 상기 식별된 관측치들에 포함된 대응하는 변수값들의 최소값, (5) 상기 식별된 관측치들에 포함된 대응하는 변수값들의 평균, 또는 (6) 상기 식별된 관측치들에 포함된 대응하는 변수값들의 함수의 값에 설정하는 단계를 포함한다.In some embodiments, the actions of the present invention include: for each data set, if the time interval of the data set is shorter than the time interval of the time series data, by down-sampling the observations of the data set, The method further includes converting a time interval into a time interval of the time series data. In some embodiments, down-sampling the observations of the data set comprises: for each instance of the time interval of the time series data in a time period corresponding to the data set: each of the time intervals of the time series data identifying all observations in the data set associated with times corresponding to an instance; aggregating the identified observations to produce an aggregate observation; and replacing the identified observations in the data set with the aggregate observation. In some embodiments, the number of identified observations corresponding to the instance of the time interval of the time series data is equal to a ratio between the time interval of the time series data and the time interval of the data set. In some embodiments, aggregating the identified observations comprises: (1) a corresponding variable value initially included in the identified observations; (2) the value of each variable in the aggregated observation; the most recently included corresponding variable value of the observations, (3) the maximum value of the corresponding variable value included in the identified observations, (4) the minimum value of the corresponding variable value included in the identified observations, ( 5) setting to the average of the corresponding variable values included in the identified observations, or (6) the value of a function of the corresponding variable values included in the identified observations.

일부 실시예에서, 상기 시계열 데이터의 시간 간격은 상기 데이터 세트들의 시간 간격으로 이루어진 그룹으로부터 선택된다. 일부 실시예에서, 상기 데이터 세트들은 제1 시간 간격을 나타내는 제1 데이터 세트 및 상기 제1 시간 간격보다 큰 제2 시간 간격을 나타내는 제2 데이터 세트를 포함하고, 상기 제2 시간 간격은 상기 시계열 데이터의 시간 간격으로서 선택되고, 상기 방법은: 상기 제1 데이터 세트의 관측치들을 다운-샘플링함으로써, 상기 제1 데이터 세트의 시간 간격을 상기 시계열 데이터의 시간 간격으로 변환하는 단계를 더 포함한다. 일부 실시예에서, 상기 시계열 데이터의 시간 간격은 상기 데이터 세트의 시간 간격들의 각각과 상이하다.In some embodiments, the time interval of the time series data is selected from the group consisting of time intervals of the data sets. In some embodiments, the data sets include a first data set representing a first time interval and a second data set representing a second time interval greater than the first time interval, wherein the second time interval comprises the time series data , the method further comprising: converting a time interval of the first data set to a time interval of the time series data by down-sampling the observations of the first data set. In some embodiments, the time interval of the time series data is different from each of the time intervals of the data set.

일부 실시예에서, 상기 시계열 데이터의 관측치들의 적어도 그룹은 제1 변수의 각각의 값들을 포함하고, 상기 방법은, 상기 예측 모델을 상기 트레이닝 데이터에 맞춤화하고, 상기 맞춤화된 모델을 상기 테스팅 데이터에 대해 테스트하기 전에: 상기 제1 변수의 값들이 시간 값들을 포함하는 것으로 결정하는 단계; 상기 그룹의 각각의 관측치에 대해, 제2 변수의 각각의 값을 생성하는 단계로서, 상기 제2 변수의 값은 상기 제1 변수의 시간 값과 기준 시간 값 사이의 오프셋을 포함하는, 단계; 및 상기 제2 변수의 값들을 상기 그룹 내의 각각의 관측치들에 가산하는 단계를 더 포함한다. 일부 실시예에서, 본 방법의 액션은 상기 그룹 내의 관측치들로부터 상기 제1 변수의 상기 값들을 제거하는 단계를 더 포함한다. 일부 실시예에서, 상기 기준 시간은 이벤트의 날짜를 포함한다. 일부 실시예에서, 상기 이벤트는 출생, 결혼, 학교 졸업, 고용주에 대한 고용의 개시 또는 특정 직위에서의 업무의 시작을 포함한다.In some embodiments, the at least group of observations in the time series data comprises respective values of a first variable, the method comprising: fitting the predictive model to the training data; and applying the customized model to the testing data. before testing: determining that the values of the first variable include time values; generating, for each observation in the group, a respective value of a second variable, the value of the second variable comprising an offset between the time value of the first variable and a reference time value; and adding the values of the second variable to each observation in the group. In some embodiments, the action of the method further comprises removing said values of said first variable from observations within said group. In some embodiments, the reference time includes a date of the event. In some embodiments, the event comprises a birth, marriage, school graduation, commencement of employment with an employer, or commencement of work in a particular position.

일부 실시예에서, 상기 변수들은 제1 변수 및 제2 변수를 포함하고, 본 방법은의 액션은: 상기 제1 변수의 값의 변화들과 상기 제2 변수의 값의 상관된 변화들로, 상기 제1 변수 및 상기 제2 변수의 값들의 변화들이 상관되는 것으로 결정하는 단계; 및 상기 제1 변수의 값의 변화들과 상기 제2 변수의 값의 상기 상관된 변화들 사이의 상기 시간 래그의 지속 시간을 나타내는 그래픽 컨텐츠를 그래픽 사용자 인터페이스를 통해 표시하는 단계를 더 포함한다.In some embodiments, the variables include a first variable and a second variable, and the method comprises: changes in the value of the first variable and correlated changes in the value of the second variable, wherein the determining that changes in values of the first variable and the second variable are correlated; and displaying, via a graphical user interface, graphical content representing a duration of the time lag between changes in the value of the first variable and the correlated changes in the value of the second variable.

일부 실시예에서, 상기 예상 범위는 (1) 상기 시계열 데이터의 시간 간격, (2) 상기 시계열 데이터에 포함된 관측치들의 수, (3) 상기 시계열 데이터에 대응하는 기간, 및/또는 (4) 마이크로초, 밀리초, 초, 분, 시간, 일, 주, 월, 분기, 계절, 년, 십년, 백년 및 천년으로 이루어지는 그룹으로부터 선택되는 자연 기간에 적어도 부분적으로 기초하여 결정된다. 일부 실시예에서, 상기 예상 범위는 상기 시계열 데이터의 시간 간격의 정수배이다. 일부 실시예에서, 상기 예상 범위의 연속 예측들과 연관된 시간들 사이의 시기간은 상기 시계열 데이터의 시간 간격과 동등하다.In some embodiments, the expected range includes (1) a time interval of the time series data, (2) the number of observations included in the time series data, (3) a period corresponding to the time series data, and/or (4) micro is determined based at least in part on a natural period selected from the group consisting of seconds, milliseconds, seconds, minutes, hours, days, weeks, months, quarters, seasons, years, decades, hundred years and millennia. In some embodiments, the expected range is an integer multiple of a time interval of the time series data. In some embodiments, a period of time between times associated with successive predictions of the expected range is equal to a time interval of the time series data.

일부 실시예에서, 상기 스킵 범위는, 상기 시계열 데이터의 수집에서의 대기 시간(latency), 상기 시계열 데이터의 통신에서의 대기 시간, 상기 시계열 데이터의 분석에서의 대기 시간, 상기 시계열 데이터의 분석의 통신에서의 대기 시간, 및/또는 상기 시계열 데이터의 분석에 기초한 액션들의 구현에서의 대기 시간에 적어도 부분적으로 기초하여 결정된다. In some embodiments, the skip range includes a latency in the collection of the time series data, a latency in communication of the time series data, a latency in the analysis of the time series data, and a communication of the analysis of the time series data. is determined based, at least in part, on a wait time at , and/or a wait time at implementation of actions based on the analysis of the time series data.

일부 실시예에서, 본 방법의 액션은, 상기 트레이닝-입력 시간 범위의 지속 시간을, 상기 시계열 데이터에 포함된 관측치들의 전체 수, 시간 경과에 따른 상기 변수들의 적어도 하나의 값들에서의 변동의 양, 상기 변수들의 적어도 하나의 값들에서의 정기적인 변동의 양, 복수의 시기간들에 대한 상기 변수들의 적어도 하나의 값들에서의 변동의 일관성, 및/또는 상기 예상 범위의 지속 시간에 적어도 부분적으로 기초하여 결정하는 단계를 더 포함한다. 일부 실시예에서, 상기 예측 모델을 상기 트레이닝 데이터에 맞춤화하는 단계는, 상기 예측 모델을 상기 트레이닝-입력 시간 범위의 부분에 대응하는 상기 트레이닝 데이터의 서브세트에 맞춤화하는 단계를 포함하고, 상기 트레이닝-입력 시간 범위의 부분은 상기 트레이닝-입력 시간 범위의 시작 시간에 후속하는 시간에서 시작하고, 상기 트레이닝-입력 시간 범위의 끝 시간에서 끝난다. 일부 실시예에서, 상기 트레이닝-입력 시간 범위의 상기 부분의 지속 시간은 상기 예상 범위의 상기 지속 시간의 정수배이다.In some embodiments, an action of the method comprises determining a duration of the training-input time range, the total number of observations included in the time series data, an amount of variation in at least one value of the variables over time, determining based, at least in part, on an amount of periodic variation in at least one values of the variables, a consistency of variation in at least one values of the variables over a plurality of time periods, and/or a duration of the expected range. further comprising the step of In some embodiments, fitting the predictive model to the training data comprises fitting the predictive model to a subset of the training data corresponding to a portion of the training-input time range, wherein the training- The portion of the input time range starts at a time subsequent to the start time of the training-input time range and ends at the end time of the training-input time range. In some embodiments, the duration of said portion of said training-input time range is an integer multiple of said duration of said expected range.

일부 실시예에서, 본 방법의 액션은, 상기 예측 모델을 상기 트레이닝 데이터에 맞춤화하기 전에 상기 트레이닝 데이터를 다운-샘플링하는 단계를 더 포함한다. 일부 실시예에서, 상기 트레이닝 데이터를 다운-샘플링하는 단계는: 상기 트레이닝 데이터로부터 상기 데이터 세트들의 적어도 하나로부터 얻어진 모든 관측치들을 제거하는 단계를 포함한다. 일부 실시예에서, 상기 트레이닝 데이터를 다운-샘플링하는 단계는: 상기 트레이닝 데이터의 다운-샘플링된 시간 간격을 상기 시계열 데이터의 시간 간격의 정수배로 설정하는 단계; 및 상기 트레이닝 데이터의 다운-샘플링된 시간 간격의 각각의 인스턴스에 대해: 상기 트레이닝 데이터의 다운-샘플링된 시간 간격의 각각의 인스턴스에 대응하는 시간들과 연관된 상기 트레이닝 데이터의 모든 관측치들을 식별하는 단계, 집계 관측치를 집계하기 위해 상기 식별된 관측치들을 집계하는 단계, 및 상기 트레이닝 데이터의 상기 식별된 관측치들을 상기 집계 관측치로 대체하는 단계를 포함한다. 일부 실시예에서, 본 방법의 액션은 테스팅 데이터에 대해 맞춤화된 모델을 테스트하기 전에 테스팅 데이터를 다운-샘플링하는 단계를 더 포함한다.In some embodiments, the actions of the method further comprise down-sampling the training data prior to fitting the predictive model to the training data. In some embodiments, down-sampling the training data comprises: removing from the training data all observations obtained from at least one of the data sets. In some embodiments, the down-sampling of the training data comprises: setting the down-sampled time interval of the training data to an integer multiple of the time interval of the time series data; and for each instance of the down-sampled time interval of the training data: identifying all observations of the training data associated with times corresponding to each instance of the down-sampled time interval of the training data; aggregating the identified observations to aggregate an aggregate observation, and replacing the identified observations in the training data with the aggregate observation. In some embodiments, the actions of the method further comprise down-sampling the testing data prior to testing the customized model against the testing data.

일부 실시예에서, 본 방법의 액션은 예측 모델의 상호 유효성 검증(performing cross-validation)을 수행하는 것을 더 포함한다. 일부 실시예에서, 상기 트레이닝 데이터는 제1 트레이닝 데이터이고, 상기 테스팅 데이터는 제1 테스팅 데이터이고, 상기 맞춤화된 모델은 제1 맞춤화된 모델이고, 상기 예측 모델의 상호 유효성 검증을 수행하는 단계는: (ⅰ) 상기 시계열 데이터로부터 제2 트레이닝 데이터 및 제2 테스팅 데이터를 생성하는 단계로서, 상기 제2 트레이닝 데이터는 상기 데이터 세트들의 적어도 하나의 관측치들의 제3 서브세트를 포함하고, 상기 제2 테스팅 데이터는 상기 데이터 세트들의 적어도 하나의 관측치들의 제4 서브세트를 포함하는, 상기 단계; (j) 제2 맞춤화된 모델을 얻기 위해 상기 예측 모델을 상기 제2 트레이닝 데이터에 맞춤화하는 단계; 및 (k) 상기 제2 맞춤화된 모델을 상기 제2 테스팅 데이터에 대해 테스트하는 단계를 포함한다.In some embodiments, the action of the method further comprises performing cross-validation of the predictive model. In some embodiments, the training data is first training data, the testing data is first testing data, the customized model is a first customized model, and performing mutual validation of the predictive model comprises: (i) generating second training data and second testing data from the time series data, wherein the second training data comprises a third subset of at least one observation of the data sets, the second testing data comprises a fourth subset of the at least one observations in the data sets; (j) fitting the predictive model to the second training data to obtain a second customized model; and (k) testing the second customized model against the second testing data.

일부 실시예에서, 상기 관측치들의 제1 서브세트는 트레이닝 시간들의 제1 범위를 커버하는 슬라이딩 트레이닝 윈도우에 대응하고, 상기 제1 서브세트에 포함된 각각의 관측치는 트레이닝 시간들의 상기 제1 범위 내의 시간과 연관되고, 관측치들의 상기 제3 서브세트는 트레이닝 시간들의 제2 범위를 커버하는 슬라이딩 트레이닝 윈도우에 대응하고, 상기 제3 서브세트에 포함되는 각각의 관측치는 트레이닝 시간들의 상기 제2 범위 내의 시간과 연관되고, 트레이닝 시간들의 상기 제1 범위의 가장 이른 시간은 트레이닝 시간들의 상기 제2 범위의 가장 이른 시간보다 더 이르다. 일부 실시예에서, 상기 관측치들의 제2 서브세트는 테스팅 시간들의 제1 범위를 커버하는 슬라이딩 테스트 윈도우에 대응하고, 상기 제2 서브세트에 포함된 각각의 관측치는 테스팅 시간들의 상기 제1 범위 내의 시간과 연관되고, 관측치들의 상기 제4 서브세트는 테스팅 시간들의 제2 범위를 커버하는 슬라이딩 테스트 윈도우에 대응하고, 상기 제4 서브세트에 포함되는 각각의 관측치는 테스팅 시간들의 상기 제2 범위 내의 시간과 연관되고, 테스팅 시간들의 상기 제1 범위의 가장 이른 시간은 테스팅 시간들의 상기 제2 범위의 가장 이른 시간보다 더 이르다. 일부 실시예에서, 상기 제1 테스팅 시간 범위는 상기 제2 트레이닝 시간 범위와 부분적으로 중첩된다. 일부 실시예에서, 상기 제2 테스팅 시간 범위는 상기 제1 트레이닝 시간 범위의 임의의 부분과 중첩되지 않고, 상기 제2 트레이닝 시간 범위의 임의의 부분과 중첩되지 않는다.In some embodiments, the first subset of observations corresponds to a sliding training window covering a first range of training times, and each observation included in the first subset is a time within the first range of training times. wherein the third subset of observations corresponds to a sliding training window covering a second range of training times, and each observation included in the third subset corresponds to a time within the second range of training times. associated, wherein the earliest time of the first range of training times is earlier than the earliest time of the second range of training times. In some embodiments, the second subset of observations corresponds to a sliding test window covering a first range of testing times, and each observation included in the second subset is a time within the first range of testing times. wherein the fourth subset of observations corresponds to a sliding test window covering a second range of testing times, and each observation included in the fourth subset corresponds to a time within the second range of testing times. associated, the earliest time of the first range of testing times is earlier than the earliest time of the second range of testing times. In some embodiments, the first testing time range partially overlaps the second training time range. In some embodiments, the second testing time range does not overlap any portion of the first training time range and does not overlap any portion of the second training time range.

일부 실시예에서, 본 방법의 액션은, 상기 시계열 데이터를 적어도 제1 분할 및 제2 분할을 포함하는 복수의 분할들로 분할하는 단계를 더 포함한다. 일부 실시예에서, 상기 시계열 데이터를 복수의 분할들로 분할하는 단계는, 상기 데이터 세트들의 각각을 대응하는 분할에 할당하는 단계를 포함한다. 일부 실시예에서, 상기 시계열 데이터를 복수의 분할들로 분할하는 단계는 상기 시계열 데이터를 시간적으로 분할하는 단계를 포함하고, 상기 분할들 각각은 상기 시계열 데이터와 연관된 시기간의 각각의 부분에 대응하고, 상기 시계열 데이터에 포함된 각각의 관측치는 상기 관측치와 연관된 상기 시간과 매칭되는 상기 시기간의 부분에 대응하는 분할에 할당된다.In some embodiments, the action of the method further comprises partitioning the time series data into a plurality of partitions including at least a first partition and a second partition. In some embodiments, partitioning the time series data into a plurality of partitions comprises assigning each of the data sets to a corresponding partition. In some embodiments, partitioning the time series data into a plurality of partitions comprises temporally partitioning the time series data, each of the partitions corresponding to a respective portion of a time period associated with the time series data; Each observation included in the time series data is assigned a partition corresponding to a portion of the time period that matches the time associated with the observation.

일부 실시예에서, 상기 제1 트레이닝 데이터는 상기 시계열 데이터의 제1 분할에 포함된 관측치들의 서브세트를 포함하고; 상기 제1 테스팅 데이터는 상기 제1 분할을 제외한 상기 시계열 데이터의 분할들 모두에 포함된 상기 관측치들의 각각의 서브세트들을 포함하고; 상기 제2 트레이닝 데이터는 상기 시계열 데이터의 제2 분할에 포함된 상기 관측치들의 서브세트를 포함하고; 및 상기 제2 테스팅 데이터는 상기 제2 분할을 제외한 상기 시계열 데이터의 분할들 모두에 포함된 상기 관측치들의 각각의 서브세트들을 포함한다. 일부 실시예에서, 상기 시계열 데이터의 제1 분할은 상기 제1 트레이닝 데이터 및 상기 제2 트레이닝 데이터와 상기 제1 테스팅 데이터 및 상기 제2 테스팅 데이터를 포함하고, 상기 시계열 데이터의 제2 분할은 홀드아웃(holdout) 데이터를 포함하고, 상기 방법의 액션은 상기 제1 맞춤화된 모델 및 상기 제2 맞춤화된 모델을 상기 홀드아웃 데이터에 대해 테스트하는 단계를 더 포함한다. 일부 실시예에서, 상기 홀드아웃 데이터에는 예측 모델이 맞춤화되지 않는다.In some embodiments, the first training data comprises a subset of observations included in the first partition of the time series data; the first testing data includes respective subsets of the observations included in all partitions of the time series data except the first partition; the second training data includes a subset of the observations included in the second partition of the time series data; and the second testing data includes respective subsets of the observations included in all partitions of the time series data except the second partition. In some embodiments, the first partition of the time series data comprises the first training data and the second training data and the first testing data and the second testing data, and the second partition of the time series data is a holdout holdout data, and wherein the action of the method further comprises testing the first customized model and the second customized model against the holdout data. In some embodiments, no predictive model is customized to the holdout data.

일부 실시예에서, 본 방법의 액션은 상기 예측 모델의 네스팅된(nested) 상호 유효성 검증을 수행하는 단계를 더 포함한다. 일부 실시예에서, 상기 예측 모델의 네스팅된 상호 유효성 검증을 수행하는 단계는: 적어도 상기 시계열 데이터의 제1 분할 및 상기 시계열 데이터의 제2 분할을 포함하는 제1 복수의 분할들로 상기 시계열 데이터를 분할하는 단계; 및 적어도 상기 시계열 데이터의 상기 제1 분할의 제1 분할 및 상기 시계열 데이터의 상기 제1 분할의 제2 분할을 포함하는 상기 시계열 데이터의 상기 제1 분할의 복수의 분할들로 상기 시계열 데이터의 상기 제1 분할을 분할하는 단계를 포함하고, 상기 트레이닝 데이터는 상기 시계열 데이터의 상기 제1 분할의 상기 제1 분할을 포함하고, 상기 테스팅 데이터는 적어도 상기 시계열 데이터의 상기 제1 분할의 상기 제1 분할 이외의 상기 시계열 데이터의 상기 제1 분할의 복수의 분할들을 포함한다.In some embodiments, the actions of the method further comprise performing nested mutual validation of the predictive model. In some embodiments, performing nested mutual validation of the predictive model comprises: the time series data into a first plurality of partitions comprising at least a first partition of the time series data and a second partition of the time series data. partitioning; and a plurality of divisions of the first division of the time series data including at least a first division of the first division of the time series data and a second division of the first division of the time series data. dividing one partition, wherein the training data comprises the first partition of the first partition of the time series data, and the testing data is at least other than the first partition of the first partition of the time series data. a plurality of partitions of the first partition of the time series data of

일부 실시예에서, 상기 트레이닝 데이터는 제1 트레이닝 데이터이고, 상기 테스팅 데이터는 제1 테스팅 데이터이고, 상기 맞춤화된 모델은 제1 맞춤화된 모델이고, 상기 예측 모델의 네스팅된 상호 유효성 검증을 수행하는 단계는: (i) 상기 시계열 데이터의 상기 제1 분할로부터, 제2 트레이닝 데이터 및 제2 테스팅 데이터를 생성하는 단계로서, 상기 제2 트레이닝 데이터는 상기 시계열 데이터의 상기 제1 분할의 상기 제2 분할을 포함하고, 상기 제2 테스팅 데이터는 적어도 상기 시계열 데이터의 상기 제1 분할의 상기 제2 분할 이외의 상기 데이터 세트의 제1 분할의 복수의 분할들을 포함하는, 상기 단계; (j) 제2 맞춤화된 모델을 얻기 위해 상기 예측 모델을 상기 제2 트레이닝 데이터에 맞춤화하는 단계; 및 (k) 상기 제2 맞춤화된 모델을 상기 제2 테스팅 데이터에 대해 테스트하는 단계를 더 포함한다.In some embodiments, the training data is first training data, the testing data is first testing data, and the customized model is a first customized model, performing nested mutual validation of the predictive model. The steps include: (i) generating, from the first partition of the time series data, second training data and second testing data, wherein the second training data is the second partition of the first partition of the time series data. wherein the second testing data comprises at least a plurality of partitions of the first partition of the data set other than the second partition of the first partition of the time series data; (j) fitting the predictive model to the second training data to obtain a second customized model; and (k) testing the second customized model against the second testing data.

일부 실시예에서, 상기 네스팅된 상호 유효성 검증을 수행하는 단계는: 상기 제1 맞춤화된 모델 및 상기 제2 맞춤화된 모델을 상기 시계열 데이터의 상기 제2 분할에 대해 테스트하는 단계; 및 상기 제1 맞춤화된 모델 및 상기 제2 맞춤화된 모델을 상기 시계열 데이터의 상기 제2 분할에 대해 테스트한 결과에 기초하여, 상기 제1 맞춤화된 모델과 상기 제2 맞춤화된 모델을 비교하는 단계를 더 포함한다.In some embodiments, performing the nested mutual validation comprises: testing the first customized model and the second customized model for the second partition of the time series data; and comparing the first customized model and the second customized model based on a result of testing the first customized model and the second customized model on the second partition of the time series data. include more

일부 실시예에서, 본 방법의 액션은, 상기 맞춤화된 모델에 대해, 상기 시계열 데이터의 상기 하나 이상의 피처들의 모델-특유 예측 값들을 결정하는 단계를 더 포함한다. 일부 실시예에서, 본 방법의 액션은, 상기 피처들의 상기 모델-특유 예측값들에 대해 적어도 부분적으로 기초하여, 상기 시계열 데이터로부터 피처를 제거하고, 상기 시계열 데이터에서 2개 이상의 피처들로부터 도출된 피처를 생성하고 상기 도출된 피처를 상기 시계열 데이터에 가산하고, 상기 예측 모델을 다른 예측 모델과 혼합하고, 그리고/또는 상기 예측 문제에 대해 예측 모델링 절차들의 적절성을 평가하는 프로세스 중에 자원들을 할당하는 것으로 이루어지는 그룹으로부터 선택되는 적어도 하나의 액션을 수행하는 단계를 더 포함한다.In some embodiments, the action of the method further comprises determining, for the customized model, model-specific predictive values of the one or more features of the time series data. In some embodiments, the action of the method is to remove a feature from the time series data, based at least in part on the model-specific predicted values of the features, and a feature derived from two or more features in the time series data. generating and adding the derived features to the time series data, mixing the predictive model with other predictive models, and/or allocating resources during the process of evaluating the adequacy of predictive modeling procedures for the predictive problem. The method further includes performing at least one action selected from the group.

일부 실시예에서, 본 방법의 액션은: 상기 예측 문제의 특성들 및/또는 각각의 예측 모델링 절차들의 속성들에 적어도 부분적으로 기초하여, 상기 예측 문제에 대한 복수의 예측 모델링 절차들의 적절성을 결정하는 단계; 상기 예측 문제에 대해 선택된 모델링 절차들의 결정된 적절성에 기초하여 상기 복수의 예측 모델링 절차들로부터 하나 이상의 예측 모델링 절차들을 선택하는 단계; 및 상기 하나 이상의 예측 모델링 절차들을 수행하는 단계를 더 포함한다.In some embodiments, the action of the method comprises: determining, based at least in part on characteristics of the prediction problem and/or properties of respective predictive modeling procedures, the suitability of a plurality of predictive modeling procedures for the prediction problem. step; selecting one or more predictive modeling procedures from the plurality of predictive modeling procedures based on the determined suitability of the selected modeling procedures for the predictive problem; and performing the one or more predictive modeling procedures.

일부 실시예에서, 상기 하나 이상의 예측 모델링 절차들을 수행하는 단계는: 명령어를 복수의 프로세싱 노드들에 송신하는 단계로서, 상기 명령어는 상기 선택된 모델링 절차들의 실행을 위해 상기 프로세싱 노드들의 자원들을 할당하는 자원 할당 스케줄을 포함하고, 상기 자원 할당 스케줄은 상기 예측 문제에 대해 상기 선택된 모델링 절차들의 적절성에 적어도 부분적으로 기초하는, 상기 단계; 상기 자원 할당 스케줄에 따라 상기 복수의 프로세싱 노드들에 의해 상기 선택된 모델링 절차들의 실행의 결과들을 수신하는 단계로서, 상기 결과들은 상기 선택된 모델링 절차들에 의해 생성된 예측 모델들, 및/또는 상기 예측 문제와 연관된 시계열 데이터에 대한 상기 생성된 모델들의 스코어들을 포함하는, 상기 단계; 및 상기 생성된 모델들로부터, 상기 선택된 예측 모델의 스코어에 적어도 부분적으로 기초하여 상기 예측 문제에 대해 예측 모델을 선택하는 단계를 포함한다.In some embodiments, performing the one or more predictive modeling procedures comprises: sending an instruction to a plurality of processing nodes, wherein the instruction is a resource that allocates resources of the processing nodes for execution of the selected modeling procedures. the method comprising: an allocation schedule, wherein the resource allocation schedule is based at least in part on adequacy of the selected modeling procedures for the prediction problem; receiving results of execution of the selected modeling procedures by the plurality of processing nodes according to the resource allocation schedule, wherein the results are predictive models generated by the selected modeling procedures, and/or the prediction problem comprising scores of the generated models for time series data associated with ; and selecting, from the generated models, a predictive model for the predictive problem based at least in part on a score of the selected predictive model.

일부 실시예에서, 본 방법의 액션은, 상기 맞춤화된 모델을 다른 맞춤화된 모델과 혼합하여 혼합된 예측 모델을 생성하는 단계를 더 포함한다.In some embodiments, the actions of the method further comprise mixing the customized model with other customized models to produce a blended predictive model.

일부 실시예에서, 본 방법의 액션은, 상기 맞춤화된 모델을 배치하는 단계를 더 포함한다. 일부 실시예에서, 상기 시계열 데이터는 제1 시계열 데이터이고, 상기 맞춤화된 모델을 배치하는 단계는, 상기 맞춤화된 모델을 상기 예측 문제의 하나 이상의 인스턴스들을 나타내는 제2 시계열 데이터에 적용함으로써 하나 이상의 예측들을 생성하는 단계를 포함하고, 상기 제1 시계열 데이터는 상기 제2 시계열 데이터를 포함하지 않는다. 일부 실시예에서, 상기 시계열 데이터는 제1 시계열 데이터이고, 상기 맞춤화된 모델을 배치하는 단계는, 제2 시계열 데이터에 적어도 부분적으로 기초하여 상기 맞춤화된 모델을 리프레시하는 단계를 포함한다. 일부 실시예에서, 상기 맞춤화된 모델은 제1 맞춤화된 모델이고, 상기 제2 시계열 데이터에 적어도 부분적으로 기초하여 상기 맞춤화된 모델을 리프레시하는 단계는: 제2 맞춤화된 모델을 생성하기 위해 상기 예측 모델링 절차를 상기 제2 시계열 데이터에 대해 수행하는 단계; 및 리프레시된 예측 모델을 생성하기 위해 상기 제1 맞춤화된 모델과 상기 제2 맞춤화된 모델을 혼합하는 단계를 포함한다. 일부 실시예에서, 상기 제2 시계열 데이터에 적어도 부분적으로 기초하여 상기 맞춤화된 모델을 리프레시하는 단계는: 리프레시된 예측 모델을 생성하기 위해 상기 제1 시계열 데이터의 적어도 일부 및 상기 제2 시계열 데이터의 적어도 일부를 포함하는 제3 시계열 데이터에 대해 상기 예측 모델링 절차를 수행하는 단계를 포함한다.In some embodiments, the action of the method further comprises deploying the customized model. In some embodiments, the time series data is first time series data, and placing the customized model comprises applying the customized model to second time series data representing one or more instances of the prediction problem to generate one or more predictions. and generating, wherein the first time series data does not include the second time series data. In some embodiments, the time series data is first time series data, and deploying the customized model comprises refreshing the customized model based at least in part on second time series data. In some embodiments, the customized model is a first customized model, and refreshing the customized model based at least in part on the second time series data comprises: the predictive modeling to generate a second customized model performing a procedure on the second time series data; and mixing the first customized model and the second customized model to generate a refreshed predictive model. In some embodiments, refreshing the customized model based at least in part on the second time series data comprises: at least a portion of the first time series data and at least a portion of the second time series data to generate a refreshed predictive model. and performing the predictive modeling procedure on third time series data including a portion.

일부 실시예에서, 상기 맞춤화된 모델은 하나 이상의 서버들에 배치되고, 다른 맞춤화된 모델들은 또한 상기 하나 이상의 서버들에 배치되고, 상기 맞춤화된 모델 및 상기 다른 맞춤화된 모델에 대한 예측 요청은, (1) 예측을 생성하기 위해 상기 맞춤화된 모델들의 각각에 의해 사용된 시간 양의 추정, 및/또는 (2) 상기 맞춤화된 모델들에 대한 예측 요청들이 수신되는 빈도의 추정에 적어도 부분적으로 기초하여 상기 서버들 중에 할당된다. 일부 실시예에서, 각각의 예측 요청이 각각의 스레드(thread)에 할당되고, 상기 각각의 예측 요청은 연관된 대기 시간-감도 값을 갖고, 특정 서버 상에서 실행되는 스레드들의 수는 상기 특정 서버 상에서 실행되는 상기 스레드들의 대기 시간-감도 값들에 적어도 부분적으로 기초하여 결정된다.In some embodiments, the customized model is deployed on one or more servers, other customized models are also deployed on the one or more servers, and a prediction request for the customized model and the other customized model includes: based at least in part on 1) an estimate of the amount of time used by each of the customized models to generate a prediction, and/or (2) an estimate of the frequency at which prediction requests for the customized models are received. assigned among the servers. In some embodiments, each prediction request is assigned to a respective thread, each prediction request has an associated latency-sensitivity value, and the number of threads executing on a particular server is the number of threads executing on the particular server. determined based at least in part on latency-sensitivity values of the threads.

일부 실시예에서, 본 방법의 액션은: 상기 시계열 데이터에 포함된 2개 이상의 피처들의 상호 작용 강도를 나타내는 메트릭의 값을 결정하는 단계; 및 상기 메트릭의 상기 값이 임계값을 초과하는 경우, 상기 2개 이상의 피처들의 상기 값들에 기초하여 새로운 피처의 시계열 값들을 생성하고, 상기 새로운 피처를 상기 시계열 데이터에 가산하는 단계를 더 포함한다.In some embodiments, the actions of the method include: determining a value of a metric representing an interaction strength of two or more features included in the time series data; and if the value of the metric exceeds a threshold, generating time-series values of a new feature based on the values of the two or more features, and adding the new feature to the time-series data.

일부 실시예에서, 본 방법의 액션은, 상기 시계열 데이터의 시간 분해능을 결정하는 단계를 더 포함한다. 일부 실시예에서, 상기 타겟은 사용자 입력에 기초하여 식별된다.In some embodiments, the action of the method further comprises determining a temporal resolution of the time series data. In some embodiments, the target is identified based on user input.

본 양태의 다른 실시예는, 예측 모델링 장치를 포함하고, 본 장치는: 예측 모델링 절차를 인코딩하는 머신-실행 가능 모듈을 저장하도록 구성되는 메모리로서, 상기 예측 모델링 절차는 적어도 하나의 사전-프로세싱 작업 및 적어도 하나의 모델-맞춤 작업을 포함하는 복수의 작업을 포함하는, 상기 메모리; 및 상기 머신-실행 가능 모듈을 실행하도록 구성되는 적어도 하나의 프로세서를 포함하고, 상기 머신-실행 가능 모듈의 실행은 상기 장치로 하여금 상기 예측 모델링 절차를 수행하게 한다. 사전-프로세스 작업의 수행은: (a) 하나 이상의 데이터 세트들을 포함하는 시계열 데이터를 얻는 단계로서, 각각의 데이터 세트는 복수의 관측치들을 포함하며, 각각의 관측치는 (1) 상기 관측치와 연관된 시간의 표시 및 (2) 하나 이상의 변수들의 각각의 값들을 포함하는, 상기 단계, (b) 상기 시계열 데이터의 시간 간격을 결정하는 단계, (c) 상기 하나 이상의 변수들을 타겟들로서 식별하고, 제로 이상의 다른 변수들을 피처들로서 식별하는 단계; 및 (d) 상기 시계열 데이터에 의해 나타내어지는 예측 문제와 연관된 예상 범위 및 스킵 범위를 결정하는 단계를 포함하고, 예상 범위는 상기 타겟들의 값들이 예측될 기간의 지속 시간을 나타내고, 상기 스킵 범위는 상기 예상 범위에서의 가장 빠른 예측과 연관된 시간과 상기 예상 범위에서의 예측들이 기초가 되는 최근 관측치와 연관된 시간 사이의 시간 래그를 나타낸다. 예측 모델링 절차의 수행은 모델-맞춤 작업의 수행을 포함하고,이 모델-맞춤 작업의 수행: (e) 상기 시계열 데이터로부터 트레이닝 데이터를 생성하는 단계로서, 상기 트레이닝 데이터는 상기 데이터 세트들 중 적어도 하나의 데이터 세트의 관측치들의 제1 서브세트를 포함하고, 상기 관측치들의 제1 서브세트는 상기 관측치들의 트레이닝-입력 및 트레이닝-출력 컬렉션들을 포함하고, 상기 트레이닝-입력 및 상기 트레이닝-출력 컬렉션들에서 상기 관측치들과 연관된 시간들은 트레이닝-입력 시간 범위 및 트레이닝-출력 시간 범위에 각각 대응하고, 상기 스킵 범위는 상기 트레이닝-입력 시간 범위의 끝을 상기 트레이닝-출력 시간 범위의 시작으로부터 분리하고, 상기 트레이닝-출력 시간 범위의 지속 시간은 적어도 상기 예상 기간만큼 긴, 상기 단계, (f) 상기 시계열 데이터로부터 테스팅 데이터를 생성하는 단계로서, 상기 테스팅 데이터는 상기 데이터 세트들 중 적어도 하나의 데이터 세트의 관측치들의 제2 서브세트를 포함하고, 상기 관측치들의 제2 서브세트는 상기 관측치들의 테스트-입력 및 테스트-유효성 검증 컬렉션들을 포함하고, 상기 테스트-입력 및 상기 테스트-유효성 검증 컬렉션들에서 상기 관측치들과 연관된 시간들은 테스트-입력 시간 범위 및 테스트-유효성 검증 시간 범위에 각각 대응하고, 상기 스킵 범위는 상기 테스트-입력 시간 범위의 끝을 상기 테스트-유효성 검증 시간 범위의 시작으로부터 분리하고, 상기 테스트-유효성 검증 시간 범위의 지속 시간은 적어도 상기 예상 범위만큼 긴, 상기 단계, (g) 예측 모델을 상기 트레이닝 데이터에 맞춤화하는 단계; 및 (h) 맞춤화된 모델을 상기 테스팅 데이터에 대해 테스트하는 단계를 포함한다. 머신-실행 가능 모듈은 작업들 사이의 의존성을 나타내는 지향 그래프를 포함할 수 있다.Another embodiment of this aspect includes a predictive modeling apparatus, the apparatus comprising: a memory configured to store a machine-executable module encoding a predictive modeling procedure, the predictive modeling procedure performing at least one pre-processing task and a plurality of tasks including at least one model-fitting task; and at least one processor configured to execute the machine-executable module, wherein the execution of the machine-executable module causes the apparatus to perform the predictive modeling procedure. The performance of the pre-process operation comprises: (a) obtaining time series data comprising one or more data sets, each data set comprising a plurality of observations, each observation comprising (1) a time series associated with the observation. an indication and (2) respective values of one or more variables, (b) determining a time interval of the time series data, (c) identifying the one or more variables as targets, and zero or more other variables identifying them as features; and (d) determining an expected range and a skip range associated with a prediction problem represented by the time series data, wherein the expected range indicates a duration of time for which values of the targets are to be predicted, and wherein the skip range is the It represents the time lag between the time associated with the earliest prediction in the expected range and the time associated with the most recent observation on which the predictions in the expected range are based. Performing the predictive modeling procedure comprises performing a model-fitting task, wherein performing the model-fitting task: (e) generating training data from the time series data, the training data comprising at least one of the data sets a first subset of observations in a data set of The times associated with observations correspond to a training-input time range and a training-output time range, respectively, wherein the skip range separates the end of the training-input time range from the beginning of the training-output time range, and wherein the duration of an output time range is at least as long as the expected duration; two subsets, wherein the second subset of observations includes test-input and test-validation collections of the observations, the time associated with the observations in the test-input and test-validation collections are respectively corresponding to a test-input time range and a test-validation time range, the skip range separating the end of the test-input time range from the beginning of the test-validation time range, and the test-validation time range wherein the duration of the range is at least as long as the expected range; (g) fitting a predictive model to the training data; and (h) testing the customized model against the testing data. A machine-executable module may include a directed graph representing dependencies between tasks.

피처의 예측 값 결정Determining the predicted value of a feature

예측 문제에 대한 정확한 예측 모델이 이용 가능한 경우에도, (1) 예측 문제 자체 및 (2) 특정 예측 모델이 어떻게 정확한 예측 결과를 생성하는지를 이해하는 것이 어려울 수 있다. "피처 중요도"의 메트릭은 이러한 이해를 용이하게 할 수 있다. 일반적으로 피처 중요도 메트릭은 데이터 세트가 나타내는 예측 문제의 결과를 예측하기 위해 데이터 세트의 피처에 대한 예측값을 나타낸다. 피처 중요도를 측정하기 위한 종래 기술은 일반적으로 특정 유형의 예측 모델에만 적용 가능하며, 일반적으로 다른 유형의 예측 모델과 함께 사용하기에 적합하지 않다. 따라서, 임의의 예측 모델 또는 다양한 예측 모델 세트에 대한 피처 중요도를 측정할 수 있는 도구가 필요하다. 또한, 피처 중요도 메트릭을 사용하여 예측 모델링 절차 평가, 엔지니어링 작업 기능 및 예측 모델 혼합에 대한 자원 할당을 가이드할 수 있는 도구가 필요하며, 이에 의해 문제를 예측하기 위한 잠재적 예측 모델링 기술의 공간의 비용 효과적인 평가를 용이하게 한다.Even when accurate predictive models for predictive problems are available, it can be difficult to understand (1) the predictive problem itself and (2) how a particular predictive model produces accurate predictive results. A metric of “feature importance” may facilitate this understanding. In general, a feature importance metric represents a predicted value for a feature in a data set in order to predict the outcome of the prediction problem that the data set represents. Prior art techniques for measuring feature importance are generally only applicable to certain types of predictive models, and are generally not suitable for use with other types of predictive models. Therefore, there is a need for a tool that can measure feature importance for any predictive model or set of various predictive models. In addition, there is a need for tools that can use feature importance metrics to guide the evaluation of predictive modeling procedures, engineering work functions, and resource allocation for predictive model mixes, thereby cost-effectively reducing the space of potential predictive modeling techniques for predicting problems. facilitate evaluation.

일반적으로, 본 명세서에 설명되는 주제의 다른 혁신적인 양태는 방법으로 구현될 수 있으며, 본 방법은: (a) 복수의 예측 모델링 절차들을 수행하는 단계로서, 상기 예측 모델링 절차들의 각각은 예측 모델과 연관되고, 각각의 모델링 절차를 수행하는 단계는 초기 예측 문제를 나타내는 초기 데이터 세트에 상기 연관된 예측 모델을 맞춤화하는 단계를 포함하는, 상기 단계; (b) 상기 맞춤화된 예측 모델들의 각각의 제1 각각의 정확도 스코어를 결정하는 단계로서, 각각의 맞춤화된 모델의 상기 제1 정확도 스코어는, 상기 맞춤화된 모델이 상기 초기 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 단계; (c) 상기 초기 데이터 세트에 포함된 각각의 관측치들에 걸쳐 피처의 값을 셔플링(shuffling)함으로써, 수정된 예측 문제를 나타내는 수정된 데이터 세트를 생성하는 단계; (d) 상기 맞춤화된 예측 모델들의 각각의 제2 각각의 정확도 스코어를 결정하는 단계로서, 각각의 맞춤화된 모델의 제2 정확도 스코어는, 상기 맞춤화된 모델이 상기 수정된 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 단계; 및 (e) 상기 맞춤화된 모델들의 각각에 대한 상기 피처의 각각의 모델-특유 예측 값을 결정하는 단계를 포함하고, 각각의 맞춤화된 모델에 대한 상기 피처의 모델-특유 예측 값은 상기 맞춤화된 모델의 상기 제1 정확도 스코어 및 상기 제2 정확도 스코어에 기초한다.In general, another innovative aspect of the subject matter described herein may be embodied in a method comprising: (a) performing a plurality of predictive modeling procedures, each of which is associated with a predictive model wherein performing each modeling procedure comprises fitting the associated predictive model to an initial data set representing an initial predictive problem; (b) determining a first respective accuracy score of each of the customized predictive models, wherein the first accuracy score of each customized model determines that the customized model determines one or more results of the initial prediction problem. indicating accuracy of prediction; (c) shuffling the value of a feature over each of the observations included in the initial data set, thereby generating a modified data set representing a modified prediction problem; (d) determining a respective second respective accuracy score of each of the customized predictive models, wherein the second accuracy score of each customized model determines that the customized model determines one or more results of the modified prediction problem. indicating accuracy of prediction; and (e) determining a respective model-specific predictive value of the feature for each of the customized models, wherein the model-specific predictive value of the feature for each customized model is determined by the customized model. based on the first accuracy score and the second accuracy score of

이 양태의 다른 실시예는 대응 컴퓨터 시스템, 장치 및 각각 이 방법의 액션을 수행하도록 구성된, 하나 이상의 컴퓨터 저장 디바이스 상에 기록된 컴퓨터 프로그램을 포함한다. 하나 이상의 컴퓨터로 이루어진 시스템은 소프트웨어, 펌웨어, 하드웨어 또는 이들의 조합이 동작 중에 있는 시스템 상에 설치되어 시스템이 동작을 수행하게 함으로써 특정 동작을 수행하도록 구성될 수 있다. 하나 이상의 컴퓨터 프로그램은 데이터 프로세싱 장치에 의해 실행될 때 장치로 하여금 액션을 수행하게 하는 명령을 포함함으로써 특정 동작을 수행하도록 구성될 수 있다.Another embodiment of this aspect includes a corresponding computer system, apparatus, and computer program recorded on one or more computer storage devices, each configured to perform the action of the method. A system of one or more computers may be configured to perform a particular operation by installing software, firmware, hardware, or a combination thereof on the system in operation, causing the system to perform the operation. One or more computer programs may be configured to perform particular operations by including instructions that, when executed by the data processing apparatus, cause the apparatus to perform the actions.

전술한 실시예 및 다른 실시예는 각각 선택적으로 단독으로 또는 조합으로 이하의 특징의 하나 이상을 포함할 수 있다. 일부 실시예에서, 본 방법의 액션은: 상기 복수의 예측 모델링 절차를 수행하기 전에, 상기 초기 데이터 세트의 특성들, 상기 초기 예측 문제의 특성들 및/또는 상기 피처의 특성들에 기초하여 상기 예측 문제에 대한 상기 복수의 예측 모델링 절차들을 선택하는 단계를 더 포함한다. 일부 실시예에서, 상기 복수의 예측 모델링 절차들은 랜덤 포리스트(random forest) 모델링 절차, 일반화된 가산 모델링 절차 및 지원 벡터 머신 모델링 절차로 이루어진 그룹으로부터 선택된 2 이상의 모델링 절차들을 포함한다. 일부 실시예에서, 상기 복수의 예측 모델링 절차들은 모델링 절차들의 제1 패밀리로부터 선택된 제1 모델링 절차 및 모델링 절차들의 제2 패밀리로부터 선택된 제2 모델링 절차를 포함한다.The foregoing and other embodiments may each optionally include one or more of the following features, alone or in combination. In some embodiments, the action of the method comprises: prior to performing the plurality of predictive modeling procedures, the prediction based on characteristics of the initial data set, characteristics of the initial prediction problem, and/or characteristics of the feature. The method further comprises selecting the plurality of predictive modeling procedures for a problem. In some embodiments, the plurality of predictive modeling procedures comprises two or more modeling procedures selected from the group consisting of a random forest modeling procedure, a generalized additive modeling procedure, and a support vector machine modeling procedure. In some embodiments, the plurality of predictive modeling procedures comprises a first modeling procedure selected from a first family of modeling procedures and a second modeling procedure selected from a second family of modeling procedures.

일부 실시예에서, 본 방법의 액션은, 상기 예측 모델의 제2 정확도 스코어를 결정하기 전에, 상기 수정된 예측 문제를 나타내는 상기 수정된 데이터 세트에 상기 예측 모델들을 다시 맞춤화하는 단계를 더 포함한다. 일부 실시예에서, 특정 맞춤화된 모델에 대한 상기 피처의 상기 결정된 모델-특유 예측 값은, 상기 특정 맞춤화된 모델의 상기 제1 정확도 스코어와 상기 제2 정확도 스코어 사이의 차이가 증가함에 따라 증가한다. 일부 실시예에서, 특정 맞춤화된 모델에 대한 상기 피처의 결정된 모델-특유 예측 값은, 상기 특정 맞춤화된 모델의 제1 정확도 스코어에 대한, 상기 특정 맞춤화된 모델의 제1 정확도 스코어와 제2 정확도 스코어 사이의 백분율 차이를 포함한다.In some embodiments, the action of the method further comprises, prior to determining the second accuracy score of the predictive model, refitting the predictive models to the modified data set representative of the modified prediction problem. In some embodiments, the determined model-specific predictive value of the feature for a particular customized model increases as the difference between the first accuracy score and the second accuracy score of the particular customized model increases. In some embodiments, the determined model-specific predictive value of the feature for a particular customized model is a first accuracy score and a second accuracy score of the particular customized model relative to the first accuracy score of the particular customized model. Include the percentage difference between

일부 실시예에서, 본 방법의 액션은, 상기 피처의 모델-특유 예측 값들에 기초하여 상기 피처의 모델-독립 예측 값을 결정하는 단계를 더 포함한다. 일부 실시예에서, 상기 피처의 모델-독립 예측 값을 결정하는 단계는, 상기 피처의 모델-특유 예측 값들의 중심 및/또는 확산의 통계적 척도를 계산하는 단계를 포함한다. 일부 실시예에서, 상기 피처의 모델-독립 예측 값을 결정하는 단계는, 상기 모델-특유 예측 값들의 중심의 통계적 척도를 계산하는 단계를 포함하고, 상기 중심의 계적 척도는 평균(mean), 중앙값(median), 및 모델-특유 예측 값들의 모드로 이루어지는 그룹으로부터 선택된다. 일부 실시예에서, 상기 피처의 모델-독립 예측 값을 결정하는 단계는, 상기 모델-특유 예측 값들의 확산의 통계적 척도를 계산하는 단계를 포함하고, 상기 확산의 통계적 척도는 범위, 분산 및 상기 모델-특유 예측 값들의 표준 편차로 이루어지는 그룹으로부터 선택된다. 일부 실시예에서, 상기 피처의 모델-독립 예측 값을 결정하는 단계는, 상기 피처의 모델-특유 예측 값들의 조합을 계산하는 단계를 포함한다. 일부 실시예에서, 상기 모델-특유 예측 값들의 조합을 계산하는 단계는, 상기 모델-특유 예측 값들의 가중화된 조합을 계산하는 단계를 포함한다. 일부 실시예에서, 상기 모델-특유 예측 값들의 가중화된 조합을 계산하는 단계는, 상기 모델-특유 예측 값들에 각각의 가중치를 할당하는 단계를 포함하고, 특정 맞춤화된 모델에 대응하는 특정 모델-특유 예측 값에 할당되는 가중치는, 상기 맞춤화된 예측 모델의 상기 제1 정확도 스코어가 증가함에 따라 증가한다.In some embodiments, the action of the method further comprises determining a model-independent predictive value of the feature based on model-specific predictive values of the feature. In some embodiments, determining the model-independent predictive value of the feature comprises calculating a statistical measure of the centroid and/or spread of model-specific predictive values of the feature. In some embodiments, determining the model-independent predictive value of the feature comprises calculating a statistical measure of centroids of the model-specific predictive values, wherein the systematic measure of centroids is a mean, a median. (median), and a mode of model-specific prediction values. In some embodiments, determining the model-independent predictive value of the feature comprises calculating a statistical measure of the spread of the model-specific predictive values, wherein the statistical measure of the spread is a range, a variance, and the model. - selected from the group consisting of the standard deviation of the unique predictive values. In some embodiments, determining the model-independent predictive value of the feature comprises calculating a combination of model-specific predictive values of the feature. In some embodiments, calculating the combination of model-specific prediction values comprises calculating a weighted combination of the model-specific prediction values. In some embodiments, calculating the weighted combination of model-specific prediction values comprises assigning respective weights to the model-specific prediction values, wherein a specific model corresponding to a specific customized model- The weight assigned to a unique predictive value increases as the first accuracy score of the customized predictive model increases.

일부 실시예에서, 상기 피처는 제1 피처이고, 본 방법의 액션은 (c1) 상기 초기 데이터 세트에 포함된 각각의 관측치들에 걸쳐 제2 피처의 값들을 셔플링함으로써, 제2 수정된 예측 문제를 나타내는 제2 수정된 데이터 세트를 생성하는 단계; (d1) 상기 맞춤화된 모델들의 각각의 제3 정확도 스코어를 결정하는 단계로서, 각각의 맞춤화된 모델의 상기 제3 정확도 스코어는, 상기 맞춤화된 모델이 상기 제2 수정된 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 단계; 및 (e1) 상기 맞춤화된 모델들의 각각에 대해 상기 제2 피처의 각각의 모델-특유 예측 값을 결정하는 단계를 포함하고, 상기 맞춤화된 모델 각각에 대한 상기 제2 피처의 모델-특유 예측 값은 상기 맞춤화된 모델의 제1 정확도 스코어 및 제3 정확도 스코어에 기초한다.In some embodiments, the feature is a first feature, and the action of the method comprises (c1) shuffling the values of the second feature over respective observations included in the initial data set, thereby resulting in a second modified prediction problem. generating a second modified data set representing (d1) determining a third accuracy score of each of the customized models, wherein the third accuracy score of each customized model determines whether the customized model determines one or more results of the second modified prediction problem. indicating accuracy of prediction; and (e1) determining a respective model-specific prediction value of the second feature for each of the customized models, wherein the model-specific prediction value of the second feature for each of the customized models is based on a first accuracy score and a third accuracy score of the customized model.

일부 실시예에서, 상기 피처는 제1 피처이고, 상기 초기 데이터 세트는 상기 제1 피처 및 복수의 제2 피처들을 포함하고, 상기 방법의 액션은 상기 제2 피처들의 각각에 대해 단계 (c), (d) 및 (e)를 수행함으로써 상기 초기 데이터 세트의 제2 피처들의 모델-특유 예측 값들을 결정하는 단계를 더 포함한다.In some embodiments, the feature is a first feature, and the initial data set comprises the first feature and a plurality of second features, the method comprising: step (c) for each of the second features; and determining model-specific prediction values of second features of the initial data set by performing (d) and (e).

일부 실시예에서, 본 방법의 액션은, 그래픽 사용자 인터페이스를 통해, 상기 초기 데이터 세트의 제1 피처 및 제2 피처들과, 상기 제1 피처 및 상기 제2 피처들의 모델-특유 예측 값들을 식별하는 그래픽 컨텐츠를 표시하는 단계를 더 포함한다. 일부 실시예에서, 상기 모델링 절차들은 특정 예측 모델과 연관된 특정 모델링 절차를 포함하는 제1 모델링 절차들이고, 상기 제1 피처 및 상기 제2 피처들의 상기 모델-특유 예측 값들은 상기 특정 예측 모델에 특정적인 상기 제1 피처 및 상기 제2 피처의 특정의 모델-특유 예측 값들을 포함하고, 상기 방법의 액션은 (a1) 상기 특정 예측 모델과 연관된 상기 특정 모델링 절차를 포함하는 복수의 제2 예측 모델링 절차들을 수행하는 단계를 더 포함한다. 일부 실시예에서, 상기 특정 예측 모델링 절차들을 수행하는 단계는, 상기 제1 피처 및 상기 제2 피처들의 특정 모델-특유 예측 값들에 기초하여 상기 초기 데이터 세트에 대한 피처 엔지니어링을 수행하는 단계를 포함한다.In some embodiments, the action of the method comprises, via a graphical user interface, identifying first and second features of the initial data set and model-specific prediction values of the first and second features. The method further includes displaying graphic content. In some embodiments, the modeling procedures are first modeling procedures comprising a specific modeling procedure associated with a specific predictive model, wherein the model-specific prediction values of the first feature and the second features are specific to the specific predictive model. comprising specific model-specific prediction values of the first feature and the second feature, wherein the action of the method comprises (a1) a plurality of second predictive modeling procedures comprising the specific modeling procedure associated with the specific predictive model. It further comprises the step of performing. In some embodiments, performing the specific predictive modeling procedures comprises performing feature engineering on the initial data set based on specific model-specific prediction values of the first feature and the second features. .

일부 실시예에서, 상기 피처 엔지니어링을 수행하는 단계는, 특정 피처가 낮은 모델-특유 예측 값을 갖는 것에 기초하여 상기 초기 데이터 세트로부터 특정 피처를 제거하는 단계를 포함한다. 일부 실시예에서, 본 방법의 액션은 상기 특정 피처의 모델-특유 예측 값이 임계값보다 낮은 것에 기초하고 그리고/또는 상기 특정 피처의 상기 모델-특유 예측 값이 상기 초기 데이터 세트의 제1 피처 및 제2 피처에 대한 상기 특정 모델-특유 예측 값들의 특정된 백분위 값인 것에 기초하여, 상기 특정 피처의 상기 모델-특유 예측 값이 낮은 것으로 결정하는 단계를 더 포함한다.In some embodiments, performing the feature engineering comprises removing particular features from the initial data set based on the features having low model-specific predictive values. In some embodiments, the action of the method is based on the model-specific prediction value of the particular feature being lower than a threshold and/or the model-specific predictive value of the particular feature being the first feature of the initial data set and determining that the model-specific predictive value of the particular feature is low based on being a specified percentile value of the particular model-specific predictive values for a second feature.

일부 실시예에서, 상기 피처 엔지니어링을 수행하는 단계는: 상기 초기 데이터 세트의 2 이상의 특정 피처들이 높은 모델-특유 예측 값들을 갖는 것에 기초하여 도출된 피처를 생성하는 단계; 및 상기 도출된 피처를 상기 초기 데이터 세트에 가산함으로써, 제2 초기 데이터 세트를 생성하는 단계를 포함한다. 일부 실시예에서, 본 방법의 액션은, 상기 특정 피처들의 상기 모델-특유 예측 값들이 임계값보다 높은 것에 기초하고 그리고/또는 상기 특정 피처들의 상기 모델-특유 예측 값들이 상기 초기 데이터 세트의 제1 피처 및 상기 제2 피처에 대한 상기 특정 모델-특유 예측 값들의 특정된 백분위 값인 것에 기초하여, 상기 특정 피처들의 상기 모델-특유 예측 값들이 높은 것으로 결정하는 단계를 더 포함한다.In some embodiments, performing the feature engineering comprises: generating a derived feature based on which two or more specific features of the initial data set have high model-specific prediction values; and adding the derived features to the initial data set, thereby generating a second initial data set. In some embodiments, the action of the method is that the model-specific prediction values of the specific features are based on being higher than a threshold value and/or that the model-specific prediction values of the specific features are based on a first value of the initial data set. determining that the model-specific prediction values of the particular features are high based on being a specified percentile value of the particular model-specific prediction values for a feature and the second feature.

일부 실시예에서, 상기 특정 예측 모델링 절차를 수행하는 단계는, 상기 특정 예측 모델을 상기 제2 초기 데이터 세트에 맞춤화하는 단계를 더 포함하고, 상기 방법의 액션은: 상기 맞춤화된 특정 예측 모델의 제1 정확도 스코어를 결정하는 단계로서, 상기 맞춤화된 특정 모델의 제1 정확도 스코어는, 상기 맞춤화된 특정 모델이 상기 초기 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 단계; 상기 제2 초기 데이터 세트에 포함된 각각의 관측치들에 걸쳐 상기 제1 피처의 값들을 셔플링함으로써, 제2 수정된 예측 문제를 나타내는 제2 수정된 데이터 세트를 생성하는 단계; 상기 맞춤화된 특정 예측 모델의 제2 정확도 스코어를 결정하는 단계로서, 상기 맞춤화된 특정 모델의 상기 제2 정확도 스코어는, 상기 맞춤화된 모델이 상기 제2 수정된 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 단계; 및 상기 맞춤화된 특정 모델에 대해 상기 제1 피처의 제2 모델-특유 예측 값을 결정하는 단계를 포함하고, 상기 맞춤화된 특정 모델에 대해 상기 제1 피처의 제2 모델-특유 예측 값이 상기 맞춤화된 특정 모델의 제1 정확도 스코어 및 제2 정확도 스코어에 기초한다.In some embodiments, performing the specific predictive modeling procedure further comprises: customizing the specific predictive model to the second initial data set, wherein the actions of the method include: determining a 1 accuracy score, wherein the first accuracy score of the customized specific model is indicative of an accuracy with which the customized specific model predicts one or more outcomes of the initial prediction problem; generating a second modified data set representing a second modified prediction problem by shuffling the values of the first feature over respective observations included in the second initial data set; determining a second accuracy score of the customized specific predictive model, wherein the second accuracy score of the customized specific model is an accuracy with which the customized model predicts one or more outcomes of the second modified prediction problem. representing the above steps; and determining a second model-specific predictive value of the first feature for the particular customized model, wherein a second model-specific predictive value of the first feature for the particular customized model is determined by the customization. based on the first accuracy score and the second accuracy score of the specified model.

일부 실시예에서, 본 방법의 액션은, 상기 복수의 제2 모델링 절차들을 수행하기 전에: 상기 초기 예측 문제에 대해 상기 선택된 모델링 절차들의 적절성에 기초하여 상기 제2 모델링 절차들을 선택하는 단계를 더 포함하고, 상기 초기 예측 문제에 대한 상기 특정 예측 모델링 절차의 적절성은, 상기 초기 데이터 세트의 하나 이상의 특정 피처들의 특성들이 상기 특정 예측 모델링 절차를 위해 높은 모델-특유 예측 값들을 갖는 것에 적어도 부분적으로 기초하여 결정된다.In some embodiments, the action of the method further comprises, before performing the plurality of second modeling procedures: selecting the second modeling procedures based on suitability of the selected modeling procedures for the initial prediction problem. and the suitability of the particular predictive modeling procedure for the initial prediction problem is based, at least in part, on characteristics of one or more particular features of the initial data set having high model-specific predictive values for the particular predictive modeling procedure. it is decided

일부 실시예에서, 본 방법의 액션은, 명령어를 복수의 프로세싱 노드들에 송신하는 단계로서, 상기 명령어는 상기 제2 모델링 절차들의 실행을 위해 상기 프로세싱 노드들의 자원들을 할당하는 자원 할당 스케줄을 포함하고, 상기 자원 할당 스케줄은 상기 초기 예측 문제에 대해 상기 제2 모델링 절차들의 적절성에 적어도 부분적으로 기초하는, 단계; 상기 자원 할당 스케줄에 따라 상기 복수의 프로세싱 노드들에 의해 상기 제2 모델링 절차들의 실행의 결과들을 수신하는 단계로서, 상기 결과들은 상기 제2 모델링 절차들에 의해 생성된 예측 모델들, 및/또는 상기 초기 예측 문제와 연관된 데이터에 대해 생성된 모델들의 스코어를 포함하는, 상기 단계; 및 상기 생성된 모델들로부터, 상기 선택된 예측 모델의 스코어에 적어도 부분적으로 기초하여 상기 초기 예측 문제에 대한 예측 모델을 선택하는 단계를 더 포함한다. 일부 실시예에서, 본 방법의 액션은, 상기 생성된 예측 모델들 중 2개 이상을 조합하여 혼합된 예측 모델을 생성하는 단계; 및 상기 혼합된 예측 모델을 평가하는 단계를 더 포함한다.In some embodiments, the action of the method comprises sending an instruction to a plurality of processing nodes, the instruction comprising a resource allocation schedule allocating resources of the processing nodes for execution of the second modeling procedures; , wherein the resource allocation schedule is based, at least in part, on adequacy of the second modeling procedures for the initial prediction problem; receiving results of execution of the second modeling procedures by the plurality of processing nodes according to the resource allocation schedule, the results including predictive models generated by the second modeling procedures, and/or the comprising a score of models generated for data associated with an initial prediction problem; and selecting, from the generated models, a predictive model for the initial prediction problem based at least in part on a score of the selected predictive model. In some embodiments, the actions of the method include: combining two or more of the generated predictive models to generate a blended predictive model; and evaluating the blended predictive model.

일부 실시예에서, 상기 수정된 예측 문제에 대한 모델-독립 피처의 예측 값은 임계 예측 값보다 작다.In some embodiments, the predicted value of the model-independent feature for the modified prediction problem is less than the threshold prediction value.

일부 실시예에서, 초기 데이터 세트는 초기 시계열 데이터 세트이고, 초기 예측 문제는 초기 시계열 예측 문제이고, 수정된 데이터 세트는 수정된 시계열 데이터 세트이고,수정된 예측 문제는 수정된 시계열 예측 문제이다. 일부 실시예에서, 맞춤화된 모델은 하나 이상의 시계열 예측 모델을 포함한다.In some embodiments, the initial data set is an initial time series data set, the initial prediction problem is an initial time series prediction problem, the modified data set is a modified time series data set, and the modified prediction problem is a modified time series prediction problem. In some embodiments, the customized model includes one or more time series prediction models.

본 양태의 다른 실시예는, 예측 모델링 장치를 포함하고, 본 장치는, 프로세서-실행 가능 명령어를 저장하도록 구성된 메모리; 및 상기 프로세서-실행 가능 명령어를 실행하도록 구성된 프로세서를 포함하고, 상기 프로세서-실행 가능 명령어의 실행은 상기 장치로 하여금 이하의 단계를 수행하게 하고, 상기 단계들은: (a) 복수의 예측 모델링 절차들을 수행하는 단계로서, 상기 예측 모델링 절차들의 각각은 예측 모델과 연관되고, 각각의 모델링 절차를 수행하는 단계는 초기 예측 문제를 나타내는 초기 데이터 세트에 상기 연관된 예측 모델을 맞춤화하는 단계를 포함하는, 상기 단계; (b) 상기 맞춤화된 각각의 예측 모델들의 각각의 제1 정확도 스코어를 결정하는 단계로서, 각각의 맞춤화된 모델의 상기 제1 정확도 스코어는, 상기 맞춤화된 모델이 상기 초기 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 단계; (c) 상기 초기 데이터 세트에 포함된 각각의 관측치들에 걸쳐 피처의 값들을 셔플링함으로써, 수정된 예측 문제를 나타내는 수정된 데이터 세트를 생성하는 단계; (d) 상기 맞춤화된 각각의 예측 모델의 각각의 제2 정확도 스코어를 결정하는 단계로서, 각각의 맞춤화된 모델의 상기 제2 정확도 스코어는, 상기 맞춤화된 모델이 상기 수정된 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 단계; 및 (e) 상기 맞춤화된 모델의 각각에 대한 상기 피처의 각각의 모델-특유 예측 값을 결정하는 단계를 포함하고, 각각의 맞춤화된 모델에 대한 상기 피처의 상기 모델-특유 예측 값은 상기 맞춤화된 모델의 상기 제1 정확도 스코어 및 상기 제2 정확도 스코어에 기초한다.Another embodiment of this aspect includes a predictive modeling apparatus comprising: a memory configured to store processor-executable instructions; and a processor configured to execute the processor-executable instructions, wherein the execution of the processor-executable instructions causes the apparatus to perform the steps of: (a) performing a plurality of predictive modeling procedures; performing each of the predictive modeling procedures associated with a predictive model, and performing each modeling procedure comprises fitting the associated predictive model to an initial data set representing an initial predictive problem. ; (b) determining a respective first accuracy score of each of the customized predictive models, wherein the first accuracy score of each customized model determines whether the customized model determines one or more results of the initial prediction problem. indicating the accuracy of the prediction; (c) shuffling the values of the feature over each of the observations included in the initial data set, thereby generating a modified data set representing the modified prediction problem; (d) determining a respective second accuracy score of each of the customized predictive models, wherein the second accuracy score of each customized model is one or more outcomes of the prediction problem for which the customized model is modified. indicating the accuracy of predicting them; and (e) determining each model-specific predictive value of the feature for each of the customized models, wherein the model-specific predictive value of the feature for each customized model is based on the first accuracy score and the second accuracy score of the model.

본 양태의 다른 실시예는, 컴퓨터-판독 가능 명령어를 저장한 제조품을 포함하고, 프로세서에 의해 실행될 때 상기 프로세서로 하여금 이하의 동작들을 수행하게 하고, 상기 동작들은: (a) 복수의 예측 모델링 절차들을 수행하는 동작으로서, 상기 예측 모델링 절차들의 각각은 예측 모델과 연관되고, 각각의 모델링 절차를 수행하는 동작은 초기 예측 문제를 나타내는 초기 데이터 세트에 상기 연관된 예측 모델을 맞춤화하는 동작을 포함하는, 상기 동작; (b) 상기 맞춤화된 각각의 예측 모델들의 각각의 제1 정확도 스코어를 결정하는 동작으로서, 각각의 맞춤화된 모델의 상기 제1 정확도 스코어는, 상기 맞춤화된 모델이 상기 초기 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 동작; (c) 상기 초기 데이터 세트에 포함된 각각의 관측치들에 걸쳐 피처의 값들을 셔플링함으로써, 수정된 예측 문제를 나타내는 수정된 데이터 세트를 생성하는 동작; (d) 상기 맞춤화된 각각의 예측 모델의 각각의 제2 정확도 스코어를 결정하는 동작으로서, 각각의 맞춤화된 모델의 상기 제2 정확도 스코어는, 상기 맞춤화된 모델이 상기 수정된 예측 문제의 하나 이상의 결과들을 예측하는 정확도를 나타내는, 상기 동작; 및 (e) 상기 맞춤화된 모델의 각각에 대한 상기 피처의 각각의 모델-특유 예측 값을 결정하는 동작을 포함하고, 각각의 맞춤화된 모델에 대한 상기 피처의 모델-특유 예측 값은 상기 맞춤화된 모델의 제1 정확도 스코어 및 상기 정확도 스코어에 기초한다.Another embodiment of this aspect includes an article of manufacture having stored thereon computer-readable instructions, wherein when executed by a processor, cause the processor to perform the following operations: (a) a plurality of predictive modeling procedures wherein each of the predictive modeling procedures is associated with a predictive model, and wherein performing each modeling procedure comprises fitting the associated predictive model to an initial data set representative of an initial prediction problem. movement; (b) determining a respective first accuracy score of each of the customized predictive models, wherein the first accuracy score of each customized model comprises: indicative of predictive accuracy; (c) generating a modified data set representing a modified prediction problem by shuffling the values of the feature over respective observations included in the initial data set; (d) determining a respective second accuracy score of each of the customized predictive models, wherein the second accuracy score of each customized model is one or more results of the prediction problem for which the customized model is modified. indicating the accuracy of predicting them; and (e) determining each model-specific predictive value of the feature for each of the customized models, wherein the model-specific predictive value of the feature for each customized model is determined by the customized model. based on the first accuracy score of and the accuracy score.

본 양태의 특정 실시예는 이하의 하나 이상의 이점을 구현하도록 실현될 수 있다. 이 양태의 일부 실시예는 예측 문제의 이해를 용이하게 하고 특정 예측 모델이 어떻게 결과를 정확하게 예측하는지 나타내기 위해 유리하게 사용될 수 있다. 본 양태의 일부 실시예는 임의의 예측 모델 또는 예측 모델의 다양한 세트에 대해 피쳐 중요도를 측정할 수 있다. 본 양태의 일부 실시예는 예측 모델링 절차를 평가하고 엔지니어링 작업을 피처링하고 예측 모델을 혼합하기 위해 자원의 할당을 안내할 수 있으며, 이에 의해 예측 문제에 대한 잠재적인 예측 모델링 기술의 공간의 비용 효과적인 평가를 용이하게 한다.Certain embodiments of this aspect may be realized to implement one or more of the following advantages. Some embodiments of this aspect can be advantageously used to facilitate understanding of prediction problems and to show how a particular predictive model accurately predicts an outcome. Some embodiments of this aspect may measure feature importance for any predictive model or various sets of predictive models. Some embodiments of this aspect may guide the allocation of resources to evaluate predictive modeling procedures, feature engineering tasks, and blend predictive models, thereby cost-effectively evaluating the space of potential predictive modeling techniques for predictive problems. to facilitate

2차 예측 모델링Secondary Predictive Modeling

특정 모델링 기술은 이해하기 어렵고 소프트웨어로 효율적으로 구현하기 어려운, 불투명 및/또는 복잡한 모델을 생성하는 경향이 있다. 이러한 모델을 구현하는 소프트웨어는 상당한 컴퓨팅 자원을 사용하여, 다른 똑같은 정확한 모델을 구현하는 소프트웨어를 이용해서 훨씬 더 효율적으로 생성될 수 있는 예측을 산출할 수 있다.Certain modeling techniques tend to produce opaque and/or complex models that are difficult to understand and efficiently implement in software. Software implementing these models can use significant computing resources to produce predictions that can be generated much more efficiently using software implementing other identically accurate models.

모델의 정확도를 크게 떨어 뜨리지 않고 하나 이상의 입력 변수("피처")(F1)의 값에 기초하여 하나 이상의 출력 변수("타겟")(T)의 값을 예측하는 1차 예측 모델(M1)의 불투명도 및/또는 복잡도를 감소시키는 기술이 필요하다. 본 발명자들은 1차 모델(M1)의 2차 모델(M2)을 구축함으로써 이러한 요구가 충족될 수 있다는 것을 인식하고 이해하고 있다. 2차 모델은 동일한 피처(F1)(또는 그 서브세트) 및/또는 1차 모델에 의해 사용되지 않는 하나 이상의 피처들에 기초하여 타겟들(T)에 대한 1차 모델의 예측된 값을 예측할 수 있다.A first-order predictive model (M1) that predicts the values of one or more output variables (“targets”) (T) based on the values of one or more input variables (“features”) (F1) without significantly reducing the accuracy of the model. Techniques are needed to reduce opacity and/or complexity. The inventors recognize and understand that this need can be met by building a second-order model M2 of the first-order model M1. The second-order model may predict the predicted value of the first-order model for targets T based on the same feature F1 (or a subset thereof) and/or one or more features not used by the first-order model. have.

많은 경우, 이 양태의 실시예에 의해 생성된 2차 모델은 대응하는 1차 모델과 정확하게 동일하거나 더 정확하며, 2차 모델을 구현하는 소프트웨어는 대응하는 1차 모델을 구현하는 소프트웨어보다 실질적으로 더 효율적이다. 이 양태의 실시예가 혼합된 1차 모델의 2차 모델을 생성하는 데 사용될 때, 2차 모델은 많은 경우에 특히 정확하며 많은 경우 2차 모델을 구현하는 소프트웨어는 상응하는 1차 모델을 구현하는 것보다 소프트웨어보다 더 효율적(예를 들어, 보다 적은 연산 자원 사용). 이 양태의 일부 실시예에 따라 생성된 2차 모델은 복잡한 1차 모델을 이해하는 데 유익할 수 있고 그리고/또는 정확한 예측 모델을 구현하는 소프트웨어를 생성하는 작업을 단순화할 수 있다.In many cases, the quadratic model generated by embodiments of this aspect is exactly the same as or more accurate than the corresponding first-order model, and the software implementing the second-order model is substantially more accurate than the software implementing the corresponding first-order model. Efficient. When embodiments of this aspect are used to generate a second-order model of a mixed first-order model, the second-order model is in many cases particularly accurate and in many cases the software implementing the second-order model is to implement the corresponding first-order model. More efficient than software (eg, using less computational resources). Second order models generated in accordance with some embodiments of this aspect may be beneficial in understanding complex first order models and/or may simplify the task of creating software implementing accurate predictive models.

일반적으로, 본 명세서에 설명되는 주제의 다른 혁신적인 양태는 예측 모델링 방법으로 구현될 수 있으며, 본 방법은, 맞춤화된 1차 예측 모델을 얻는 단계로서, 상기 1차 예측 모델은 하나 이상의 제1 입력 변수들의 값들에 기초하여 예측 문제의 하나 이상의 출력 변수들의 값들을 예측하도록 구성되는, 상기 단계; 및 맞춤화된 1차 모델에 대해 2차 예측 모델링 절차를 수행하는 단계로서, 상기 2차 모델링 절차는 2차 예측 모델과 연관되는, 단계를 포함하고, 상기 맞춤화된 1차 모델에 대해 상기 2차 예측 모델링 절차를 수행하는 단계는: 복수의 2차 관측치들을 포함하는 2차 입력 데이터를 생성하는 단계로서, 각각의 2차 관측치는 하나 이상의 제2 입력 변수들의 각각의 관측된 값들 및 상기 출력 변수들의 예측된 값들을 포함하고, 각각의 2차 관측치에 대해 상기 2차 입력 데이터를 생성하는 단계는: 상기 제2 입력 변수들의 각각의 관측된 값들 및 상기 제1 입력 변수들의 대응하는 관측된 값들을 얻는 단계와, 상기 출력 변수들의 각각의 예측된 값들을 생성하기 위해 상기 입력 변수들의 대응하는 관측된 값들에 상기 1차 예측 모델을 적용하는 단계를 포함하는, 상기 단계, 상기 2차 입력 데이터로부터, 2차 트레이닝 데이터 및 2차 테스팅 데이터를 생성하는 단계, 상기 2차 예측 모델을 상기 2차 트레이닝 데이터에 맞춤화함으로써, 상기 맞춤화된 1차 모델의 맞춤화된 2차 예측 모델을 생성하는 단계, 및 상기 2차 테스팅 데이터에 대해 상기 맞춤화된 1차 모델의 상기 맞춤화된 2차 예측 모델을 테스트하는 단계를 포함할 수 있다.In general, another innovative aspect of the subject matter described herein may be embodied in a predictive modeling method, the method comprising: obtaining a customized first-order predictive model, the first-order predictive model comprising one or more first input variables configured to predict values of one or more output variables of a prediction problem based on values of ; and performing a quadratic predictive modeling procedure on the customized first-order model, wherein the second-order modeling procedure is associated with a second-order predictive model, wherein the second-order prediction is performed on the customized first-order model. Performing the modeling procedure may include: generating secondary input data comprising a plurality of secondary observations, each secondary observation having respective observed values of one or more second input variables and a prediction of the output variables. and generating the secondary input data for each secondary observation comprising: obtaining respective observed values of the second input variables and corresponding observed values of the first input variables; and applying the first-order predictive model to the corresponding observed values of the input variables to generate respective predicted values of the output variables. generating training data and secondary testing data, customizing the secondary predictive model to the secondary training data, thereby generating a customized secondary predictive model of the customized primary model, and the secondary testing testing the customized secondary predictive model of the customized primary model against data.

전술한 및 다른 실시예는 각각 단독으로 또는 조합으로 이하의 하나 이상의 특징을 선택적으로 포함할 수 있다. 일부 실시예에서, 상기 맞춤화된 1차 모델을 얻는 단계는, 상기 1차 예측 모델과 연관된 1차 예측 모델 절차를 수행하는 단계를 포함하고, 상기 1차 예측 모델링 절차를 수행하는 단계는: 복수의 1차 관측치들을 포함하는 1차 입력 데이터를 얻는 단계로서, 각각의 1차 관측치는 상기 제1 입력 변수들의 각각의 관측된 값들 및 상기 출력 변수들의 대응하는 관측된 값들을 포함하는, 단계; 상기 1차 입력 데이터로부터, 1차 트레이닝 데이터 및 1차 테스팅 데이터를 생성하는 단계, 상기 1차 예측 모델을 상기 1차 트레이닝 데이터에 맞춤화하는 단계, 및 상기 맞춤화된 1차 예측 모델을 상기 테스팅 데이터에 대해 테스트하는 단계를 포함한다. 일부 실시예에서, 상기 맞춤화된 1차 모델을 얻는 단계는 2개의 맞춤화된 예측 모델들을 혼합하는 단계를 포함한다.The foregoing and other embodiments may each optionally include one or more of the following features, alone or in combination. In some embodiments, obtaining the customized first-order model includes performing a first-order predictive model procedure associated with the first-order predictive model, and performing the first-order predictive modeling procedure includes: a plurality of obtaining primary input data comprising primary observations, each primary observation including respective observed values of the first input variables and corresponding observed values of the output variables; from the primary input data, generating primary training data and primary testing data; fitting the primary prediction model to the primary training data; and applying the customized primary prediction model to the testing data. including testing for In some embodiments, obtaining the customized first-order model comprises mixing two customized predictive models.

일부 실시예에서, 상기 맞춤화된 1차 모델을 얻는 단계는: 상기 예측 문제의 특성들 및/또는 상기 각각의 1차 예측 모델링 절차들의 속성들에 적어도 부분적으로 기초하여, 상기 예측 문제에 대한 복수의 1차 예측 모델링 절차들의 적절성을 결정하는 단계; 상기 예측 문제에 대해 상기 선택된 모델링 절차들의 상기 결정된 적절성에 기초하여, 상기 복수의 1차 예측 모델링 절차들로부터 하나 이상의 예측 모델링 절차들을 선택하는 단계; 및 상기 하나 이상의 예측 모델링 절차들을 수행하는 단계를 포함한다. 일부 실시예에서, 상기 하나 이상의 예측 모델링 절차들을 수행하는 단계는: 명령어를 복수의 프로세싱 노드들로 송신하는 단계로서, 상기 명령어는 상기 선택된 모델링 절차들의 실행을 위해 상기 프로세싱 노드들의 자원들을 할당하는 자원 할당 스케줄을 포함하고, 상기 자원 할당 스케줄은 상기 예측 문제에 대해 상기 선택된 모델링 절차들의 적절성에 적어도 일부 기초하는, 상기 단계; 상기 자원 할당 스케줄에 따라, 상기복수의 프로세싱 노드들에 의해 상기 선택된 모델링 절차들의 실행의 결과들을 수신하는 단계로서, 상기 결과들은 상기 선택된 모델링 절차들에 의해 생성된 예측 모델들을 포함하는, 상기 단계; 및 상기 맞춤화된 1차 모델을 상기 생성된 모델들로부터 선택하는 단계를 포함한다.In some embodiments, obtaining the customized first-order model comprises: based at least in part on characteristics of the prediction problem and/or properties of the respective first-order predictive modeling procedures, a plurality of models for the prediction problem. determining the adequacy of the first-order predictive modeling procedures; selecting one or more predictive modeling procedures from the plurality of primary predictive modeling procedures based on the determined relevance of the selected modeling procedures for the predictive problem; and performing the one or more predictive modeling procedures. In some embodiments, performing the one or more predictive modeling procedures comprises: sending an instruction to a plurality of processing nodes, wherein the instruction is a resource that allocates resources of the processing nodes for execution of the selected modeling procedures. an allocation schedule, wherein the resource allocation schedule is based at least in part on suitability of the selected modeling procedures for the prediction problem; receiving, according to the resource allocation schedule, results of execution of the selected modeling procedures by the plurality of processing nodes, the results comprising predictive models generated by the selected modeling procedures; and selecting the customized first-order model from the generated models.

일부 실시예에서, 상기 2차 예측 모델은 RuleFit 모델 및 일반화된 가산 모델로 이루어지는 그룹으로부터 선택된다. 일부 실시예에서, 본 발명의 액션은, 상기 2차 모델의 상호 유효성 검증을 수행하는 단계를 더 포함하고, 상기 2차 입력 데이터는 적어도 하나의 데이터 세트를 포함하고, 상기 2차 트레이닝 데이터를 생성하는 단계는 상기 데이터 세트의 제1 서브세트를 얻는 단계를 포함하고, 상기 2차 테스팅 데이터를 생성하는 단계는 상기 데이터 세트의 제2 서브세트를 얻는 단계를 포함한다.In some embodiments, the quadratic predictive model is selected from the group consisting of a RuleFit model and a generalized additive model. In some embodiments, the action of the present invention further comprises performing mutual validation of the secondary model, wherein the secondary input data includes at least one data set, and generating the secondary training data. performing includes obtaining a first subset of the data set, and generating the secondary testing data includes obtaining a second subset of the data set.

일부 실시예에서, 상기 2차 트레이닝 데이터는 제1 2차 트레이닝 데이터이고, 상기 2차 테스팅 데이터는 제1 2차 테스팅 데이터이고, 상기 맞춤화된 2차 모델은 제1 맞춤화된 2차 모델이고, 상기 2차 모델의 상호 유효성 검증을 수행하는 단계는: (a) 상기 2차 입력 데이터로부터 제2 2차 트레이닝 데이터 및 제2 2차 테스팅 데이터를 생성하는 단계로서, 상기 제2 2차 트레이닝 데이터는 상기 데이터 세트의 제3 서브세트를 포함하고, 상기 제2 2차 테스팅 데이터는 상기 데이터 세트의 제4 서브세트를 포함하는, 단계; (b) 제2 맞춤화된 2차 예측 모델을 얻기 위해 상기 2차 예측 모델을 상기 제2 2차 트레이닝 데이터에 맞춤화하는 단계; 및 (c) 상기 제2 2차 예측 모델을 상기 제2 2차 테스팅 데이터에 대해 테스트하는 단계를 포함한다.In some embodiments, the secondary training data is first secondary training data, the secondary testing data is primary secondary testing data, the customized secondary model is a first customized secondary model, and The performing mutual validation of the secondary model includes: (a) generating second secondary training data and second secondary testing data from the secondary input data, wherein the second secondary training data is comprising a third subset of the data set, and wherein the second secondary testing data comprises a fourth subset of the data set; (b) fitting the secondary prediction model to the second secondary training data to obtain a second customized secondary prediction model; and (c) testing the second secondary predictive model against the second secondary testing data.

일부 실시예에서, 본 방법의 액션은, 적어도 제1 분할 및 제2 분할을 포함하는 복수의 분할들로 상기 데이터 세트를 분할하는 단계를 더 포함한다. 일부 실시예에서, 상기 데이터 세트를 복수의 분할들로 분할하는 단계는, 상기 데이터 세트의 각각의 관측치를 각각의 분할에 랜덤하게 할당하는 단계를 포함한다. 일부 실시예에서, 상기 제1 2차 트레이닝 데이터는 상기 데이터 세트의 제1 분할을 포함하고; 상기 제1 2차 테스팅 데이터는 상기 제1 분할을 제외한 상기 데이터 세트의 분할들을 모두 포함하고; 상기 제2 2차 트레이닝 데이터는 상기 데이터 세트의 제2 분할을 포함하고; 및 상기 제2 2차 테스팅 데이터는 상기 제2 분할을 제외한 상기 데이터 세트의 분할들을 모두 포함한다. 일부 실시예에서, 상기 제1 2차 트레이닝 데이터는 상기 데이터 세트의 제1 분할의 서브세트를 포함하고; 상기 제1 2차 테스팅 데이터는 상기 제1 분할을 제외한 상기 데이터 세트의 분할들 모두의 각각의 서브세트들을 포함하고; 상기 제2 2차 트레이닝 데이터는 상기 데이터 세트의 제2 분할의 서브세트를 포함하고; 및 상기 제2 2차 테스팅 데이터는 상기 제2 분할을 제외한 데이터 세트의 분할들 모두의 각각의 서브세트들을 포함한다.In some embodiments, the action of the method further comprises partitioning the data set into a plurality of partitions comprising at least a first partition and a second partition. In some embodiments, partitioning the data set into a plurality of partitions comprises randomly assigning each observation of the data set to each partition. In some embodiments, the first secondary training data comprises a first partition of the data set; the first secondary testing data includes all partitions of the data set except for the first partition; the second secondary training data comprises a second partition of the data set; and the second secondary testing data includes all partitions of the data set except for the second partition. In some embodiments, the first secondary training data comprises a subset of the first partition of the data set; the first secondary testing data includes respective subsets of all partitions of the data set except for the first partition; the second secondary training data comprises a subset of the second partition of the data set; and the second secondary testing data includes respective subsets of all of the partitions of the data set except the second partition.

일부 실시예에서, 상기 2차 입력 데이터는 제1 분할 및 제2 분할을 포함하고, 상기 데이터 세트는 상기 2차 입력 데이터의 제1 분할을 포함하고, 및 상기 방법의 액션은 상기 2차 입력 데이터의 제2 분할을 포함하는 홀드아웃 데이터에 대해 상기 제1 및 제2 맞춤화된 2차 모델들을 테스트하는 단계를 더 포함한다. 일부 실시예에서, 상기 홀드아웃 데이터에는 예측 모델이 맞춤화되지 않는다.In some embodiments, the secondary input data includes a first partition and a second partition, the data set includes a first partition of the secondary input data, and the action of the method comprises: and testing the first and second customized quadratic models against holdout data comprising a second partition of . In some embodiments, no predictive model is customized to the holdout data.

일부 실시예에서, 상기 2차 예측 모델링 절차를 수행하는 단계는, 상기 2차 예측 모델의 네스팅된 상호 유효성 검증을 수행하는 단계를 더 포함한다. 일부 실시예에서, 상기 2차 입력 데이터는 적어도 하나의 데이터 세트를 포함하고; 상기 2차 예측 모델의 네스팅된 상호 유효성 검증을 수행하는 단계는: 상기 데이터 세트의 적어도 제1 분할 및 상기 데이터 세트의 제2 분할을 포함하는 상기 데이터 세트의 제1 복수의 분할들로 상기 데이터 세트를 분할하는 단계, 및 적어도 상기 데이터 세트의 상기 제1 분할의 제1 분할 및 상기 데이터 세트의 상기 제1 분할의 제2 분할을 포함하는 상기 데이터 세트의 상기 제1 분할의 복수의 분할들로 상기 데이터 세트의 상기 제1 분할을 분할하는 단계를 포함한다. 일부 실시예에 따르면, 상기 2차 트레이닝 데이터는 상기 데이터 세트의 상기 제1 분할의 제1 분할을 포함하고; 및 상기 2차 테스팅 데이터는 상기 데이터 세트의 상기 제1 분할의 제1 분할을 제외한 상기 데이터 세트의 상기 제1 분할의 분할들을 모두 포함한다.In some embodiments, performing the secondary predictive modeling procedure further includes performing nested mutual validation of the secondary predictive model. In some embodiments, the secondary input data comprises at least one data set; The performing nested mutual validation of the quadratic predictive model comprises: the data into a first plurality of partitions of the data set comprising at least a first partition of the data set and a second partition of the data set. partitioning a set into a plurality of partitions of the first partition of the data set comprising at least a first partition of the first partition of the data set and a second partition of the first partition of the data set partitioning the first partition of the data set. According to some embodiments, the secondary training data comprises a first partition of the first partition of the data set; and the secondary testing data includes all partitions of the first partition of the data set except for the first partition of the first partition of the data set.

일부 실시예에서, 상기 2차 트레이닝 데이터는 제1 2차 트레이닝 데이터이고, 상기 2차 테스팅 데이터는 제1 2차 테스팅 데이터이고, 상기 맞춤화된 2차 모델은 제1 맞춤화된 모델이고, 상기 2차 예측 모델의 네스팅된 상호 유효성 검증을 수행하는 단계는: (a) 상기 데이터 세트의 제1 분할로부터, 제2 2차 트레이닝 데이터 및 제2 2차 테스팅 데이터를 생성하는 단계로서, 상기 제2 2차 트레이닝 데이터는 상기 데이터 세트의 상기 제1 분할의 제2 분할을 포함하고, 상기 제2 2차 테스팅 데이터는 상기 데이터 세트의 상기 제1 분할의 제2 분할 이외의 상기 데이터 세트의 상기 제1 분할의 복수의 분할들을 포함하는, 단계; (b) 제2 2차 맞춤화된 예측 모델을 얻기 위해 상기 2차 예측 모델을 상기 제2 2차 트레이닝 데이터에 맞춤화하는 단계; 및 (c) 상기 제2 2차 맞춤화된 모델을 상기 제2 2차 테스팅 데이터에 대해 테스트하는 단계를 더 포함한다.In some embodiments, the secondary training data is first secondary training data, the secondary testing data is primary secondary testing data, the customized secondary model is a first customized model, and the secondary Performing nested mutual validation of the predictive model comprises: (a) generating, from a first partition of the data set, second secondary training data and second secondary testing data, wherein the second the differential training data comprises a second partition of the first partition of the data set, and the second secondary testing data includes the first partition of the data set other than a second partition of the first partition of the data set comprising a plurality of partitions of ; (b) fitting the quadratic predictive model to the second quadratic training data to obtain a second quadratic customized predictive model; and (c) testing the second secondary customized model against the second secondary testing data.

일부 실시예에서, 상기 네스팅된 상호 유효성 검증을 수행하는 단계는: 상기 제1 맞춤화된 2차 모델 및 상기 제2 맞춤화된 2차 모델을 상기 데이터 세트의 제2 분할에 대해 테스트하는 단계; 및 상기 제1 맞춤화된 2차 모델 및 상기 제2 맞춤화된 2차 모델을 상기 데이터 세트의 제2 분할에 대해 테스트한 결과에 기초하여, 상기 제1 맞춤화된 2차 모델과 상기 제2 맞춤화된 2차 모델을 비교하는 단계를 더 포함한다.In some embodiments, performing the nested mutual validation comprises: testing the first customized quadratic model and the second customized quadratic model against a second partition of the data set; and based on a result of testing the first customized quadratic model and the second customized quadratic model on a second partition of the data set, the first customized quadratic model and the second customized quadratic model The method further includes comparing the car models.

일부 실시예에서, 본 방법의 액션은, 상기 맞춤화된 예측 모델들의 각각의 정확도 스코어를 결정하는 단계를 더 포함하고, 각각의 맞춤화된 모델의 상기 정확도 스코어는, 상기 맞춤화된 모델이 하나 이상의 예측 문제의 결과들을 예측하는 정확도를 나타낸다. 일부 실시예에서, 본 방법의 액션은, 상기 맞춤화된 1차 모델의 정확도 스코어와 상기 맞춤화된 2차 모델의 정확도 스코어 사이의 불일치를 결정하는 단계를 더 포함한다. 일부 실시예에서, 상기 맞춤화된 2차 모델의 정확도 스코어는 상기 맞춤화된 1차 모델의 정확도 스코어를 초과한다.In some embodiments, the actions of the method further comprise determining an accuracy score of each of the customized predictive models, wherein the accuracy score of each customized model is determined by determining that the customized model has one or more predictive problems. It represents the accuracy of predicting the results of In some embodiments, the actions of the method further comprise determining a discrepancy between an accuracy score of the customized first-order model and an accuracy score of the customized second-order model. In some embodiments, the accuracy score of the customized second-order model exceeds the accuracy score of the customized first-order model.

일부 실시예에서, 본 방법의 액션은, 하나 이상의 예측 문제들의 결과들을 예측하기 위해 상기 맞춤화된 예측 모델 각각에 의해 사용되는 연산 자원의 양을 결정하는 단계를 더 포함한다. 일부 실시예에 따르면, 본 방법의 액션은, 상기 맞춤화된 1차 모델에 의해 사용되는 연산 자원의 양과 상기 맞춤화된 2차 모델에 의해 사용되는 연산 자원의 양 사이의 불일치를 결정하는 단계를 더 포함한다. 일부 실시예에서, 상기 맞춤화된 2차 모델에 의해 사용된 연산 자원의 양은 상기 맞춤화된 1차 모델에 의해 사용되는 연산 자원의 양보다 작다.In some embodiments, the actions of the method further comprise determining an amount of computational resource used by each of the customized predictive models to predict outcomes of one or more predictive problems. According to some embodiments, the action of the method further comprises determining a discrepancy between the amount of computational resources used by the customized first-order model and the amount of computational resources used by the customized second-order model. do. In some embodiments, the amount of computational resources used by the customized second-order model is less than the amount of computational resources used by the customized first-order model.

일부 실시예에서, 본 방법의 액션은 상기 맞춤화된 2차 모델을 배치하는 단계를 더 포함한다. 일부 실시예에서, 상기 맞춤화된 2차 모델을 배치하는 단계는, 상기 예측 문제의 인스턴스들을 나타내는 다른 데이터에 상기 맞춤화된 2차 모델을 적용함으로써 복수의 예측들을 생성하는 단계를 포함하고, 상기 2차 입력 데이터는 다른 데이터를 포함하지 않는다. 일부 실시예에서, 상기 맞춤화된 2차 모델은 하나 이상의 조건부 규칙들의 세트를 포함하고, 상기 하나 이상의 조건부 규칙들의 세트는 하나 이상의 머신-실행 가능 if-then 구문들의 세트를 포함한다.In some embodiments, the action of the method further comprises deploying the customized quadratic model. In some embodiments, placing the customized quadratic model comprises generating a plurality of predictions by applying the customized quadratic model to other data representative of instances of the prediction problem, wherein the quadratic The input data does not include other data. In some embodiments, the customized quadratic model comprises a set of one or more conditional rules, wherein the set of one or more conditional rules comprises a set of one or more machine-executable if-then statements.

일부 실시예에서, 상기 2차 입력 데이터는 제1 2차 입력 데이터이고, 상기 맞춤화된 2차 모델을 배치하는 단계는 상기 맞춤화된 2차 모델을 제2 2차 입력 데이터에 적어도 부분적으로 기초하여 리프레시하는 단계를 더 포함한다. 일부 실시예에서, 상기 맞춤화된 2차 모델은 제1 맞춤화된 2차 모델이고, 상기 제2 2차 입력 데이터에 적어도 부분적으로 기초하여 상기 맞춤화된 2차 모델을 리프레시하는 단계는, 상기 제2 2차 입력 데이터로부터, 제2 2차 트레이닝 데이터 및 제2 2차 테스팅 데이터를 생성하는 단계; 상기 2차 예측 모델을 상기 제2 2차 트레이닝 데이터에 맞춤화함으로써 상기 맞춤화된 1차 모델의 제2 맞춤화된 2차 모델을 생성하는 단계; 상기 제2 2차 테스팅 데이터에 대해 상기 1차 모델의 상기 제2 맞춤화된 2차 모델을 테스트하는 단계; 및 리프레시된 2차 예측 모델을 생성하기 위해 상기 제1 맞춤화된 2차 모델과 상기 제2 맞춤화된 2차 모델을 혼합하는 단계를 포함한다. 일부 실시예에서, 상기 맞춤화된 2차 모델은 제1 맞춤화된 2차 모델이고, 상기 제2 2차 입력 데이터에 적어도 부분적으로 기초하여 상기 맞춤화된 2차 모델을 리프레시하는 단계는: 상기 제1 2차 입력 데이터의 적어도 일부 및 상기 제2 2차 입력 데이터의 적어도 일부를 포함하는 제3 2차 입력 데이터를 생성하는 단계; 상기 제3 2차 입력 데이터로부터 제3 2차 트레이닝 데이터 및 제3 2차 테스팅 데이터를 생성하는 단계; 상기 2차 예측 모델을 상기 제3 2차 트레이닝 데이터에 맞춤화함으로써 상기 맞춤화된 1차 모델의 제2 맞춤화된 2차 모델을 생성하는 단계; 및 상기 1차 모델의 상기 제2 맞춤화된 2차 모델을 상기 제3 2차 테스팅 데이터 상에서 테스트하는 단계를 포함한다.In some embodiments, the secondary input data is first secondary input data, and wherein disposing the customized secondary model refreshes the customized secondary model based at least in part on the second secondary input data. further comprising the step of In some embodiments, the customized quadratic model is a first customized quadratic model, and refreshing the customized quadratic model based at least in part on the second secondary input data comprises: generating, from the differential input data, second secondary training data and second secondary testing data; generating a second customized secondary model of the customized primary model by customizing the secondary predictive model to the second secondary training data; testing the second customized secondary model of the primary model against the second secondary testing data; and mixing the first customized quadratic model and the second customized quadratic model to generate a refreshed quadratic predictive model. In some embodiments, the customized quadratic model is a first customized quadratic model, and refreshing the customized quadratic model based at least in part on the second quadratic input data comprises: the first 2 generating third secondary input data including at least a portion of the difference input data and at least a portion of the second secondary input data; generating third secondary training data and third secondary testing data from the third secondary input data; generating a second customized secondary model of the customized primary model by customizing the secondary predictive model to the third secondary training data; and testing the second customized secondary model of the primary model on the third secondary testing data.

일부 실시예에서, 상기 제1 입력 변수들은 상기 제2 입력 변수들이다. 일부 실시예에서, 상기 제1 입력 변수들 및 상기 제2 입력 변수들 모두 특정 입력 변수를 포함한다. 일부 실시예에서, 상기 제1 입력 변수들 중 어느 것도 상기 제2 입력 변수들에 포함되지 않는다.In some embodiments, the first input variables are the second input variables. In some embodiments, both the first input variables and the second input variables include a particular input variable. In some embodiments, none of the first input variables are included in the second input variables.

일부 실시예에서, 상기 2차 모델링 절차는 복수의 2차 모델링 절차들 중 하나이고, 상기 2차 예측 모델은 복수의 제2 예측 모델 중 하나이고, 상기 방법이 액션은, 상기 복수의 2차 모델링 절차들을 상기 맞춤화된 1차 모델에 대해 수행함으로써, 상기 맞춤화된 1차 모델의 복수의 맞춤화된 2차 모델들을 생성한다. 일부 실시예에서, 본 방법의 액션은, 상기 맞춤화된 2차 예측 모델들의 각각의 정확도 스코어를 결정하는 단계를 더 포함하고, 각각의 맞춤화된 2차 모델의 정확도 스코어는, 상기 맞춤화된 2차 모델이 하나 이상의 예측 문제의 결과들을 예측하는 정확도를 나타낸다. 일부 실시예에서, 본 방법의 액션은, 어떤 정확도 스코어가 가장 높은지를 결정하는 단계; 및 가장 높은 정확도 스코어를 갖는 상기 맞춤화된 2차 모델을 배치하는 단계를 더 포함한다.In some embodiments, the secondary modeling procedure is one of a plurality of secondary modeling procedures, the secondary prediction model is one of a plurality of second prediction models, and the method comprises: By performing procedures on the customized first-order model, a plurality of customized second-order models of the customized first-order model are generated. In some embodiments, the action of the method further comprises determining an accuracy score of each of the customized quadratic predictive models, wherein the accuracy score of each customized quadratic model is: It represents the accuracy of predicting the outcomes of one or more prediction problems. In some embodiments, the actions of the method include determining which accuracy score has the highest; and placing the customized quadratic model with the highest accuracy score.

본 양태의 다른 실시예는 예측 모델링 장치를 포함하고, 본 장치는, 2차 예측 모델과 연관된 2차 예측 모델링 절차를 인코딩하는 머신-실행 가능 모듈을 저장하도록 구성되는 메모리로서, 상기 2차 예측 모델링 절차는 적어도 하나의 사전-프로세싱 작업 및 적어도 하나의 모델-맞춤 작업을 포함하는 복수의 작업을 포함하는, 상기 메모리; 및 상기 머신-실행 가능 모듈을 실행하도록 구성되는 적어도 하나의 프로세서를 포함하고, 상기 머신-실행 가능 모듈의 실행은 상기 장치로 하여금 맞춤화된 1차 예측 모델에 대해 상기 2차 예측 모델링 절차를 수행하게 한다. 2차 예측 모델링 절차의 수행은, 맞춤화된 1차 예측 모델을 얻는 것을 포함하는 상기 사전-프로세싱 작업을 수행하는 단계를 포함하고, 상기 1차 예측 모델은 하나 이상의 제1 입력 변수들의 값들에 기초하여 예측 문제의 하나 이상의 출력 변수들의 값들을 예측하도록 구성된다. 2차 예측 모델링 절차의 수행은: 상기 모델-맞춤 작업을 수행하는 단계는 포함하고, 상기 수행하는 단계는: 복수의 2차 관측치들을 포함하는 2차 데이터를 생성하는 단계로서, 각각의 2차 관측치는 하나 이상의 제2 입력 변수들의 각각의 관측된 값들 및 상기 출력 변수들의 예측된 값들을 포함하고, 각각의 2차 관측치에 대해 상기 2차 입력 데이터를 생성하는 단계는: 상기 제2 입력 변수들의 각각의 관측된 값들 및 상기 제1 입력 변수들의 대응하는 관측된 값들을 얻고, 상기 출력 변수들의 각각의 예측된 값들을 생성하기 위해 상기 입력 변수들의 대응하는 관측된 값들에 상기 1차 예측 모델을 적용하는 단계를 포함하는, 상기 단계, 상기 2차 입력 데이터로부터, 2차 트레이닝 데이터 및 2차 테스팅 데이터를 생성하는 단계, 상기 2차 예측 모델을 상기 2차 트레이닝 데이터에 맞춤화함으로써, 상기 맞춤화된 1차 모델의 맞춤화된 2차 예측 모델을 생성하는 단계, 및 상기 2차 테스팅 데이터에 대해 상기 맞춤화된 1차 모델의 상기 맞춤화된, 2차 예측 모델을 테스트하는 단계를 포함한다.Another embodiment of this aspect includes a predictive modeling apparatus, the apparatus comprising: a memory configured to store a machine-executable module encoding a second-order predictive modeling procedure associated with a second-order predictive model, the second-order predictive modeling the memory comprising a plurality of tasks including at least one pre-processing task and at least one model-fitting task; and at least one processor configured to execute the machine-executable module, wherein the execution of the machine-executable module causes the apparatus to perform the second-order predictive modeling procedure on the customized first-order predictive model. do. Performing the secondary predictive modeling procedure includes performing the pre-processing operation comprising obtaining a customized primary predictive model, wherein the primary predictive model is based on values of one or more first input variables. and predict values of one or more output variables of the prediction problem. Performing the secondary predictive modeling procedure includes: performing the model-fitting operation, the performing comprising: generating secondary data comprising a plurality of secondary observations, each secondary observation comprising: includes respective observed values of one or more second input variables and predicted values of the output variables, wherein generating the secondary input data for each secondary observation comprises: each of the second input variables obtaining observed values of and corresponding observed values of the first input variables, and applying the first-order predictive model to the corresponding observed values of the input variables to generate respective predicted values of the output variables. The customized primary model by: generating secondary training data and secondary testing data from the secondary input data; fitting the secondary predictive model to the secondary training data; generating a customized secondary predictive model of , and testing the customized, secondary predictive model of the customized primary model against the secondary testing data.

본 발명의 다른 양태 및 이점은 하기의 도면, 상세한 설명 및 청구 범위로부터 명백해질 것이며, 이들 모두는 단지 예시로서 본 발명의 원리를 예시한다.Other aspects and advantages of the present invention will become apparent from the following drawings, detailed description and claims, all of which illustrate the principles of the invention by way of example only.

몇몇 실시예의 설명, 그 동기, 및/또는 그 이점을 포함하는 상기 개요는 독자가 본 발명을 이해하는 것을 돕기 위한 것이며 어떠한 방법으로도 청구 범위의 범위를 제한하지 않는다.The above summary, including a description of several embodiments, their motivations, and/or their advantages, is intended to aid the reader in understanding the present invention and does not limit the scope of the claims in any way.

일부 실시예의 특정 이점은 첨부 도면과 관련하여 취해지는 후술하는 설명을 참조함으로써 이해될 수 있다. 도면에서, 동일한 참조 부호는 일반적으로 상이한 도면 전반에 걸쳐 동일한 부분을 나타낸다. 또한, 도면은 반드시 스케일대로일 필요는 없으며, 대신에 일반적으로 본 발명의 몇몇 실시예들의 원리를 나타내는 데 중점을 둔다.
도 1은 일부 실시예에 따른 예측 모델링 시스템의 블록도이다.
도 2는 일부 실시예에 따른 예측 모델링 작업, 기술 및 방법을 인코딩하는 머신-실행 가능 템플릿을 구축하기 위한 모델링 도구의 블록도이다.
도 3은 일부 실시예에 따른 예측 문제에 대한 예측 모델을 선택하기 위한 방법의 흐름도이다.
도 4는 일부 실시예에 따른 예측 문제에 대한 예측 모델을 선택하기 위한 방법의 다른 흐름도를 나타낸다.
도 5는 일부 실시예에 따른 예측 모델링 시스템의 개략도이다.
도 6은 일부 실시예에 따른 예측 모델링 시스템의 다른 블록도이다.
도 7은 일부 실시예에 따른 예측 모델링 시스템의 컴포넌트 간의 통신을 나타낸다.
도 8은 일부 실시예에 따른 예측 모델링 시스템의 다른 개략도이다.
도 9는 일부 실시예에 따른 시계열 예측 모델링을 위한 방법의 흐름도이다.
도 10은 일부 실시예에 따른 피처의 예측값을 결정하기 위한 방법의 흐름도이다.
도 11a는 일부 실시예에 따른 2차 예측 모델을 생성하기 위한 방법의 흐름도이다.
도 11b는 일부 실시예에 따른 2차 예측 모델링 절차를 수행하기 위한 방법의 흐름도이다.Certain advantages of some embodiments may be understood by reference to the following description taken in conjunction with the accompanying drawings. In the drawings, like reference numbers generally refer to like parts throughout different drawings. Further, the drawings are not necessarily to scale, emphasis instead generally being placed on illustrating the principles of some embodiments of the invention.
1 is a block diagram of a predictive modeling system in accordance with some embodiments.
2 is a block diagram of a modeling tool for building machine-executable templates encoding predictive modeling tasks, techniques, and methods in accordance with some embodiments.
3 is a flowchart of a method for selecting a predictive model for a predictive problem in accordance with some embodiments.
4 shows another flowchart of a method for selecting a predictive model for a predictive problem in accordance with some embodiments.
5 is a schematic diagram of a predictive modeling system in accordance with some embodiments.
6 is another block diagram of a predictive modeling system according to some embodiments.
7 illustrates communication between components of a predictive modeling system in accordance with some embodiments.
8 is another schematic diagram of a predictive modeling system in accordance with some embodiments.
9 is a flowchart of a method for time-series predictive modeling in accordance with some embodiments.
10 is a flow diagram of a method for determining a predicted value of a feature in accordance with some embodiments.
11A is a flowchart of a method for generating a quadratic predictive model in accordance with some embodiments.
11B is a flowchart of a method for performing a secondary predictive modeling procedure in accordance with some embodiments.

예측 모델링 시스템의 개요Overview of Predictive Modeling Systems

도 1을 참조하면, 일부 실시예에서, 예측 모델링 시스템(100)은 예측 모델링 탐색 엔진(110), 사용자 인터페이스(120), 예측 모델링 기술의 라이브러리(130) 및 예측 모델 배치 엔진(140)을 포함한다. 탐색 엔진은 특정 예측 문제에 적절한 예측 모델링 해결책을 생성하기 위해 예측 모델링 검색 공간을 효율적으로 탐색(예를 들어, 사전-프로세싱 단계, 모델링 알고리즘 및 사후-프로세싱 단계의 잠재적 조합)하기 위한 검색 기술(또는 "모델링 방법")을 구현할 수 있다. 검색 기술은, 어떠한 예측 모델링 기술이 예측 문제에 대한 적절한 해결책을 제공할 가능성이 있는지에 대한 초기 평가를 포함할 수 있다. 일부 실시예에서, 검색 기술은 (예를 들어, 데이터 세트의 증가하는 부분을 사용하는) 검색 공간의 증분 평가 및 (예를 들어, 일관적인 메트릭을 사용하는) 예측 문제에 대한 상이한 모델링 해결책의 적절성의 일관적인 비교를 포함한다. 일부 실시예에서, 검색 기술은 이전 검색의 결과에 기초해 적응하여 시간의 경과에 따라 검색 기술의 효과성을 향상시킬 수 있다.1 , in some embodiments, the predictive modeling system 100 includes a predictive modeling search engine 110 , a user interface 120 , a library of predictive modeling techniques 130 , and a predictive model placement engine 140 . do. A search engine is a search technique (or, for example, a potential combination of pre-processing steps, modeling algorithms, and post-processing steps) to efficiently search the predictive modeling search space to generate a predictive modeling solution appropriate to a particular predictive problem. "Modeling Method") can be implemented. The search technique may include an initial assessment of which predictive modeling techniques are likely to provide an appropriate solution to the prediction problem. In some embodiments, a search technique is an incremental evaluation of the search space (e.g., using an increasing portion of a data set) and the relevance of different modeling solutions to a prediction problem (e.g., using a consistent metric). consistent comparison of In some embodiments, the search technique may adapt based on the results of previous searches to improve the effectiveness of the search technique over time.

탐색 엔진(110)은 검색 공간에서 잠재적인 모델링 해결책을 평가하기 위해 모델링 기술의 라이브러리(130)를 사용할 수 있다. 일부 실시예에서, 모델링 기술 라이브러리(130)는 완전한 모델링 기술을 인코딩하는 머신-실행 가능 템플릿을 포함한다. 머신-실행 가능 템플릿은 하나 이상의 예측 모델링 알고리즘을 포함할 수 있다. 일부 실시예에서, 템플릿에 포함된 모델링 알고리즘은 몇몇 방식으로 관련될 수 있다. 예를 들어, 모델링 알고리즘은 동일한 모델링 알고리즘의 변형 또는 모델링 알고리즘의 패밀리 멤버일 수 있다. 일부 실시예에서, 머신-실행 가능 템플릿은 템플릿의 알고리즘(들)과 함께 사용하기에 적절한 하나 이상의 사전-프로세싱 및/또는 사후-프로세싱 단계를 추가로 포함한다. 알고리즘(들), 사전-프로세싱 단계 및/또는 사후-프로세싱 단계는 파라미터화될 수 있다. 머신-실행 가능 템플릿은 데이터 세트에 의해 나타내어지는 예측 문제에 대한 잠재적인 예측 모델링 해결책을 생성하기 위해 사용자 데이터 세트에 적용될 수 있다.The search engine 110 may use the library 130 of modeling techniques to evaluate potential modeling solutions in the search space. In some embodiments, modeling technique library 130 includes machine-executable templates that encode complete modeling techniques. The machine-executable template may include one or more predictive modeling algorithms. In some embodiments, modeling algorithms included in templates may be related in some way. For example, a modeling algorithm may be a variant of the same modeling algorithm or a member of a family of modeling algorithms. In some embodiments, the machine-executable template further comprises one or more pre-processing and/or post-processing steps suitable for use with the algorithm(s) of the template. The algorithm(s), pre-processing steps and/or post-processing steps may be parameterized. A machine-executable template can be applied to a user data set to create a potential predictive modeling solution to the predictive problem represented by the data set.

탐색 엔진(110)은 검색 공간 또는 그 일부를 탐색하기 위해 분산 컴퓨팅 시스템의 연산 자원을 사용할 수 있다. 일부 실시예에서, 탐색 엔진(110)은 분산 컴퓨팅 시스템의 자원을 사용하여 검색을 효율적으로 수행하기 위한 검색 계획을 생성하고, 분산 컴퓨팅 시스템은 검색 계획에 따라 검색을 실행한다. 분산 컴퓨팅 시스템은 예측 모델링 기술의 큐잉(queuing) 및 모니터링, 컴퓨팅 시스템 자원의 가상화, 데이터베이스에의 액세스, 검색 계획의 분할 및 모델링 기술의 평가에 대한 컴퓨팅 시스템의 자원의 할당, 실행 결과의 수집 및 조직화, 사용자 입력의 수용 등을 위한 인터페이스에 한정되지는 않지만 이를 포함하는 검색 계획에 따른 예측 모델링 해결책의 평가를 용이하게 하는 인터페이스를 제공할 수 있다.The search engine 110 may use the computational resources of the distributed computing system to search the search space or a portion thereof. In some embodiments, the search engine 110 generates a search plan for efficiently performing a search using the resources of the distributed computing system, and the distributed computing system executes the search according to the search plan. A distributed computing system is a queuing and monitoring of predictive modeling techniques, virtualization of computing system resources, access to databases, allocation of computing system resources for segmentation of search plans and evaluation of modeling techniques, collection and organization of execution results , an interface for facilitating evaluation of a predictive modeling solution according to a search plan including, but not limited to, an interface for acceptance of user input, etc. may be provided.

사용자 인터페이스(120)는 예측 모델링 공간의 검색을 모니터링 및/또는 가이드하기 위한 도구를 제공한다. 이러한 도구는 (예를 들어, 데이터 세트의 문제가 있는 변수를 강조 표시하고, 데이터 세트의 변수 간 관계를 식별하는 등에 의해) 예측 문제의 데이터 세트에 대한 통찰력 및/또는 검색 결과에 대한 통찰력을 제공할 수 있다. 일부 실시예에서, 데이터 분석자는 예를 들어 모델링 해결책을 평가하고 비교하는 데 사용되는 메트릭을 특정하고, 적절한 모델링 해결책을 인식하기 위한 기준을 특정하는 것 등에 의해 검색을 가이드하기 위해 인터페이스를 사용할 수 있다. 따라서, 사용자 인터페이스는 그 자신의 생산성을 향상시키고 및/또는 탐색 엔진(110)의 성능을 향상시키기 위해 분석자에 의해 사용될 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는 검색의 결과를 실시간으로 제시하고, 사용자가 실시간으로 검색을 가이드할 수 있게(예를 들어, 검색의 범위 또는 상이한 모델링 해결책의 평가 중 자원의 할당을 조정할 수 있게) 한다. 일부 실시예에서, 사용자 인터페이스(120)는 동일한 예측 문제 및/또는 관련된 예측 문제에 대해 작업하는 복수의 데이터 분석자의 노력을 조정하기 위한 도구를 제공한다.The user interface 120 provides tools for monitoring and/or guiding the search of the predictive modeling space. These tools provide insight into data sets of prediction problems and/or insights into search results (e.g., by highlighting problematic variables in data sets, identifying relationships between variables in data sets, etc.) can do. In some embodiments, the data analyst may use the interface to guide the search, for example, by specifying metrics used to evaluate and compare modeling solutions, specifying criteria for recognizing suitable modeling solutions, etc. . Accordingly, the user interface may be used by the analyst to improve their own productivity and/or to improve the performance of the search engine 110 . In some embodiments, the user interface 120 presents the results of the search in real time and allows the user to guide the search in real time (eg, adjust the scope of the search or the allocation of resources during evaluation of different modeling solutions) make it possible). In some embodiments, user interface 120 provides tools for coordinating the efforts of multiple data analysts working on the same prediction problem and/or related prediction problems.

일부 실시예에서, 사용자 인터페이스(120)는 모델링 기술의 라이브러리(130)에 대한 머신-실행 가능 템플릿을 개발하기 위한 도구를 제공한다. 시스템 사용자는 기존 템플릿을 수정하거나, 새로운 템플릿을 생성하거나, 라이브러리(130)로부터 템플릿을 제거하기 위해 이러한 도구를 사용할 수 있다. 이러한 방식으로, 시스템 사용자는 예측 모델링 연구의 진보를 반영하고 및/또는 독점적 예측 모델링 기술을 포함하도록 라이브러리(130)를 갱신할 수 있다.In some embodiments, user interface 120 provides tools for developing machine-executable templates for library 130 of modeling techniques. System users may use these tools to modify existing templates, create new templates, or remove templates from library 130 . In this manner, system users may update library 130 to reflect advances in predictive modeling research and/or to include proprietary predictive modeling techniques.

모델 배치(deployment) 엔진(140)은 동작 환경에서의 예측 모델(예를 들어, 탐색 엔진(110)에 의해 생성된 예측 모델)을 배치하기 위한 도구를 제공한다. 일부 실시예에서, 모델 배치 엔진은 또한 예측 모델을 모니터링 및/또는 갱신하기 위한 도구를 제공한다. 시스템 사용자는 탐색 엔진(110)에 의해 생성된 예측 모델을 배치하고, 그러한 예측 모델의 성능을 모니터링하고, (예를 들어, 새로운 데이터 또는 예측 모델링 기술에서의 진보에 기초하여) 그러한 모델을 갱신하기 위해 배치 엔진(140)을 사용할 수 있다. 일부 실시예에서, 탐색 엔진(110)은 예측 문제에 대한 검색 공간의 탐색을 가이드하기 위해(예를 들어, 예측 문제에 대한 기본 데이터 세트의 변화에 대한 응답으로 예측 모델을 다시 맞추거나 튜닝하기 위해) (예를 들어, 배치된 예측 모델의 성능을 모니터링한 결과에 기초하여) 배치 엔진(140)에 의해 수집 및/또는 생성된 데이터를 사용할 수 있다.The model deployment engine 140 provides tools for deploying a predictive model (eg, a predictive model generated by the search engine 110 ) in an operating environment. In some embodiments, the model deployment engine also provides tools for monitoring and/or updating predictive models. System users deploy predictive models generated by search engine 110, monitor the performance of such predictive models, and update such models (eg, based on new data or advances in predictive modeling techniques). For this purpose, the batch engine 140 may be used. In some embodiments, the search engine 110 may be used to guide the search of the search space for a prediction problem (eg, to refit or tune a predictive model in response to changes in the underlying data set for the prediction problem). ) may use the data collected and/or generated by the deployment engine 140 (eg, based on the results of monitoring the performance of the deployed predictive model).

예측 모델링 시스템(100)의 이러한 양태 및 다른 양태가 이하에서 더욱 상세하게 설명된다.These and other aspects of the predictive modeling system 100 are described in greater detail below.

모델링 기술의 라이브러리A library of modeling techniques

예측 모델링 기술의 라이브러리(130)는 완전한 예측 모델링 기술을 인코딩하는 머신-실행 가능 템플릿을 포함한다. 일부 실시예에서, 머신-실행 가능 템플릿은 하나 이상의 예측 모델링 알고리즘, 알고리즘(들)과 함께 사용하기에 적절한 제로(zero) 이상의 사전-프로세싱 단계, 및 알고리즘(들)과 함께 사용하기에 적절한 제로 이상의 사후-프로세싱 단계를 포함한다. 알고리즘(들), 사전-프로세싱 단계 및/또는 사후-프로세싱 단계는 파라미터화될 수 있다. 머신-실행 가능 템플릿은 데이터 세트에 의해 나타내어지는 예측 문제에 대한 잠재적인 예측 모델링 해결책을 생성하기 위해 데이터 세트에 적용될 수 있다.The library of predictive modeling techniques 130 includes machine-executable templates encoding complete predictive modeling techniques. In some embodiments, the machine-executable template includes one or more predictive modeling algorithms, zero or more pre-processing steps suitable for use with the algorithm(s), and zero or more predictive modeling algorithms suitable for use with the algorithm(s). Post-processing steps included. The algorithm(s), pre-processing steps and/or post-processing steps may be parameterized. A machine-executable template can be applied to a data set to create a potential predictive modeling solution to the prediction problem represented by the data set.

템플릿은 머신 실행을 위해, 템플릿의 예측 모델링 알고리즘(들)과 함께 사용하기에 적절한 사전-프로세싱 단계, 모델-맞춤(fitting) 단계 및/또는 사후-프로세싱 단계를 인코딩할 수 있다. 사전-프로세싱 단계의 예는 누락값 입력, 피처 엔지니어링(예를 들어, 원-핫(one-hot) 인코딩, 스플라인, 텍스트 마이닝 등), 피처 선택(예를 들어, 정보가 없는 피처 탈락, 관련성이 높은 피처 탈락, 원래 피처를 톱(top) 주요 컴포넌트로 대체 등)을 포함하지만, 이에 한정되지 않는다. 모델-맞춤 단계의 예는 알고리즘 선택, 파라미터 추정, 하이퍼-파라미터 튜닝, 스코어링, 진단 등을 포함하지만 이에 한정되지 않는다. 사후-프로세싱 단계의 예는 예측의 교정, 검열, 혼합(blending) 등을 포함하지만, 이에 한정되지 않는다.The template may encode pre-processing steps, model-fitting steps and/or post-processing steps suitable for use with the template's predictive modeling algorithm(s) for machine execution. Examples of pre-processing steps include missing-value input, feature engineering (e.g., one-hot encoding, splines, text mining, etc.), feature selection (e.g., missing features without information, unrelated high feature dropouts, replacement of original features with top major components, etc.). Examples of model-fitting steps include, but are not limited to, algorithm selection, parameter estimation, hyper-parameter tuning, scoring, diagnosis, and the like. Examples of post-processing steps include, but are not limited to, proofreading, censoring, blending, and the like.

일부 실시예에서, 머신-실행 가능 템플릿은 템플릿에 의해 인코딩되는 예측 모델링 기술의 속성을 설명하는 메타데이터를 포함한다. 메타데이터는, 템플릿이 (예를 들어, 사전-프로세싱 단계, 사후-프로세싱 단계 또는 예측 모델링 알고리즘의 단계에서) 예측 모델링 해결책의 일부로서 수행할 수 있는 하나 이상의 데이터 프로세싱 기술을 나타낼 수 있다. 이러한 데이터 프로세싱 기술은 텍스트 마이닝, 피처 정규화, 치수 감소 또는 다른 적절한 데이터 프로세싱 기술을 포함할 수 있지만 이에 한정되지 않는다. 대안적으로 또는 부가적으로, 메타데이터는 데이터 세트의 차원성에 대한 제약, 예측 문제의 타겟(들)의 특성, 및/또는 예측 문제의 피처(들)의 특성에 한정되지 않지만 이를 포함하는 템플릿에 의해 인코딩되는 예측 모델링 기술에 의해 부과되는 하나 이상의 데이터 프로세싱 제약을 나타낼 수 있다.In some embodiments, the machine-executable template includes metadata that describes properties of the predictive modeling technique encoded by the template. The metadata may represent one or more data processing techniques that the template may perform as part of a predictive modeling solution (eg, in a pre-processing step, post-processing step, or step of a predictive modeling algorithm). Such data processing techniques may include, but are not limited to, text mining, feature normalization, dimension reduction, or other suitable data processing techniques. Alternatively or additionally, the metadata may be contained in a template including, but not limited to, constraints on the dimensionality of the data set, characteristics of the target(s) of the prediction problem, and/or characteristics of the feature(s) of the prediction problem. may indicate one or more data processing constraints imposed by a predictive modeling technique encoded by

일부 실시예에서, 템플릿의 메타데이터는, 대응하는 모델링 기술이 주어진 데이터 세트에 대해 얼마나 효과적인지를 추정하는 것과 관련된 정보를 포함한다. 예를 들어, 템플릿의 메타데이터는 넓은 데이터 세트, 키가 큰 데이터 세트, 희소(sparse) 데이터 세트, 밀집(dense) 데이터 세트, 텍스트를 포함하거나 포함하지 않는 데이터 세트, 다양한 데이터 유형(예를 들어, 숫자, 서수, 카테고리, 인터프리팅(예를 들어, 날짜, 시간, 텍스트) 등)의 변수를 포함하는 데이터 세트, 다양한 통계적 특성(예를 들어, 변수의 누락값, 기수(cardinality), 분산 등과 관련된 통계적 특성)을 갖는 변수를 포함하는 데이터 세트 등에 한정되지 않지만 이를 포함하는 특정의 특징을 갖는 데이터 세트에 대해 대응하는 모델링 기술이 얼마나 잘 수행할 것으로 예측되는지를 나타낼 수 있다. 다른 예에서, 템플릿의 메타데이터는, 하나 이상의 성능 메트릭(예를 들어, 목적 함수)의 관점에서 대응하는 모델링 기술의 예측되는 성능을 나타낸다.In some embodiments, the metadata of the template includes information related to estimating how effective the corresponding modeling technique is for a given data set. For example, the metadata of a template can be large data sets, tall data sets, sparse data sets, dense data sets, data sets with or without text, and different data types (e.g. , data sets containing variables of numbers, ordinal numbers, categories, interpreting (e.g., date, time, text), etc.), various statistical characteristics (e.g., missing values of variables, cardinality, variance) It may indicate how well a corresponding modeling technique is predicted to perform on a data set having a particular characteristic including, but not limited to, a data set including a variable having a statistical characteristic associated with the same. In another example, the metadata of the template represents predicted performance of the corresponding modeling technique in terms of one or more performance metrics (eg, objective functions).

일부 실시예에서, 템플릿의 메타데이터는 프로세싱 단계의 허용된 데이터 유형(들), 구조 및/또는 차원에 한정되지 않지만 이를 포함하는 대응하는 모델링 기술에 의해 구현되는 프로세싱 단계의 특성화를 포함한다.In some embodiments, the metadata of the template includes a characterization of the processing step implemented by a corresponding modeling technique including but not limited to allowed data type(s), structure and/or dimensions of the processing step.

일부 실시예에서, 템플릿의 메타데이터는 하나 이상의 예측 문제 및/또는 데이터 세트에 템플릿에 의해 나타내어지는 예측 모델링 기술을 적용한 결과(실제 또는 예상)를 나타내는 데이터를 포함한다. 예측 모델링 기술을 예측 문제 또는 데이터 세트에 적용한 결과는, 예측 모델링 기술에 의해 생성된 예측 모델이 예측 문제 또는 데이터 세트의 타겟(들)을 예측하는 정확도, 예측 문제 또는 데이터 세트에 대한 (다른 예측 모델링 기술과 관련된) 예측 모델링 기술에 의해 생성된 예측 모델의 정확도의 등급, 예측 문제 또는 데이터 세트에 대한 예측 모델을 생성하기 위해 예측 모델링 기술을 사용하는 유틸리티를 나타내는 스코어(예를 들어, 목적 함수에 대한 예측 모델에 의해 생성된 값) 등을 포함할 수 있지만, 이에 한정되지 않는다.In some embodiments, the metadata of the template includes data representing the results (actual or expected) of applying the predictive modeling techniques represented by the template to one or more predictive problems and/or data sets. The result of applying a predictive modeling technique to a predictive problem or data set is the accuracy with which the predictive model generated by the predictive modeling technique predicts the target(s) of the predictive problem or data set, for the predictive problem or data set (other predictive modeling A score indicating the degree of accuracy of the predictive model generated by the predictive modeling technique (related to the technique), the utility of using the predictive modeling technique to generate a predictive model for a predictive problem or data set (e.g., for an objective function). values generated by the predictive model), and the like.

예측 모델링 기술을 예측 문제 또는 데이터 세트에 적용한 결과를 나타내는 데이터는 (예를 들어, 예측 문제 또는 데이터 세트에 대한 예측 모델링 기술을 사용하려는 이전 시도의 결과에 기초하여) 탐색 엔진(110)에 의해 제공되고, (예를 들어, 사용자의 전문 기술에 기초하여) 사용자에 의해 제공되고, 및/또는 임의의 다른 적절한 소스로부터 얻어질 수 있다. 일부 실시예에서, 탐색 엔진(110)은 예측 문제의 인스턴스의 실제 결과와 예측 모델링 기술을 통해 생성된 예측 모델에 의해 예측된 결과 사이의 관계에 적어도 부분적으로 기초하여 이러한 데이터를 갱신한다.Data representing the results of applying the predictive modeling technique to the predictive problem or data set is provided by the search engine 110 (eg, based on the results of previous attempts to use the predictive modeling technique to the predictive problem or data set). provided by the user (eg, based on the user's expertise), and/or may be obtained from any other suitable source. In some embodiments, the search engine 110 updates such data based, at least in part, on the relationship between the actual results of instances of the prediction problem and the results predicted by the predictive model generated through predictive modeling techniques.

일부 실시예에서, 템플릿의 메타데이터는, 모델링 기술이 분산 컴퓨팅 인프라스트럭처 상에서 얼마나 효율적으로 실행될 것인지를 추정하는 것과 관련된 대응하는 모델링 기술의 특성을 기술한다. 예를 들어, 템플릿의 메타데이터는 주어진 크기의 데이터 세트에 대해 모델링 기술을 트레이닝 및/또는 테스트하는 데 필요한 프로세싱 자원, 상호 유효성 검증 폴드의 수의 자원 소비에 대한 영향 및 하이퍼-파라미터 공간에서 검색된 포인트의 수, 모델링 기술에 의해 수행되는 프로세싱 단계의 본질적인 병렬화 등을 나타낼 수 있다.In some embodiments, the metadata of the template describes characteristics of the corresponding modeling technique related to estimating how efficiently the modeling technique will execute on a distributed computing infrastructure. For example, the template's metadata may include the processing resources needed to train and/or test modeling techniques on a data set of a given size, the impact on resource consumption of the number of mutual validation folds, and the points retrieved in the hyper-parameter space. , the intrinsic parallelism of processing steps performed by the modeling technique, etc.

일부 실시예에서, 모델링 기술의 라이브러리(130)는 예측 모델링 기술 간의 유사성(또는 차이)을 평가하기 위한 도구를 포함한다. 이러한 도구는 2개의 예측 모델링 기술 사이의 유사성을 (예를 들어, 미리 정해진 스케일 상의) 스코어, 분류(예를 들어, "매우 유사", "다소 유사", "다소 비유사", "매우 비유사"), 바이너리 결정(예를 들어, "유사" 또는 "비유사") 등으로 표현할 수 있다. 이러한 도구는 2개의 예측 모델링 기술을 동일하거나 유사한 예측 문제에 적용한 결과를 나타내는 데이터 등에 기초하여, 모델링 기술에 공통적인 프로세싱 단계에 기초하여 2개의 예측 모델링 기술 간의 유사성을 결정할 수 있다. 예를 들어, 유사한 예측 문제에 적용될 때, 다수의 (또는 높은 퍼센티지의) 그 프로세싱 단계를 공통으로 갖고 및/또는 유사한 결과를 산출하는 주어진 2개의 예측 모델링 기술에 있어서, 도구는 모델링 기술에 높은 유사성 스코어를 할당하거나 모델링 기술을 "매우 유사"로 분류할 수 있다.In some embodiments, the library 130 of modeling techniques includes tools for evaluating similarities (or differences) between predictive modeling techniques. These tools can score similarities (eg, on a predetermined scale) between two predictive modeling techniques, classify them (eg, "very similar", "somewhat similar", "somewhat dissimilar", "very dissimilar"). "), binary crystals (eg, "like" or "dislike"), and the like. Such a tool can determine the similarity between two predictive modeling techniques based on processing steps common to the modeling techniques, etc., based on data representing the results of applying the two predictive modeling techniques to the same or similar predictive problem. For example, given two predictive modeling techniques that, when applied to similar predictive problems, have many (or a high percentage) of their processing steps in common and/or produce similar results, the tool has a high degree of similarity to the modeling techniques. You can assign a score or classify the modeling technique as "very similar".

일부 실시예에서, 모델링 기술은 모델링 기술의 패밀리에 할당될 수 있다. 모델링 기술의 패밀리 분류는 (예를 들어, 직관과 경험에 기초하여) 사용자에 의해 할당될 수 있으며, (예를 들어, 모델링 기술에 공통적인 프로세싱 단계, 상이한 모델링 기술을 동일하거나 유사한 문제에 적용한 결과를 나타내는 데이터 등에 기초하여) 머신-학습 분류기에 의해 할당될 수 있거나, 다른 적절한 소스로부터 얻어질 수 있다. 예측 모델링 기술 간의 유사성을 평가하기 위한 도구는 2개의 모델링 기술 간의 유사성을 평가하기 위해 패밀리 분류에 의존할 수 있다. 일부 실시예에서, 도구는 동일한 패밀리의 모든 모델링 기술을 "유사"로 취급하고, 상이한 패밀리의 임의의 모델링 기술을 "비유사"로 취급할 수 있다. 일부 실시예에서, 모델링 기술의 패밀리 분류는 모델링 기술 간의 유사성의 도구 평가에서 단지 하나의 요인일 수 있다.In some embodiments, modeling techniques may be assigned to families of modeling techniques. A family classification of modeling techniques can be assigned by the user (eg, based on intuition and experience), and (eg, processing steps common to modeling techniques, as a result of applying different modeling techniques to the same or similar problem) may be assigned by a machine-learning classifier, etc.), or may be obtained from other suitable sources. A tool for evaluating the similarity between predictive modeling techniques may rely on family classification to evaluate the similarity between two modeling techniques. In some embodiments, the tool may treat all modeling techniques in the same family as “similar” and any modeling techniques in different families as “dissimilar”. In some embodiments, the classification of a family of modeling techniques may be only one factor in the instrumental assessment of similarities between modeling techniques.

일부 실시예에서, 예측 모델링 시스템(100)은 예측 문제의 라이브러리(도 1에 미도시)를 포함한다. 예측 문제의 라이브러리는 예측 문제의 특성을 나타내는 데이터를 포함할 수 있다. 일부 실시예에서, 예측 문제의 특성을 나타내는 데이터는 예측 문제를 나타내는 데이터 세트의 특성을 나타내는 데이터를 포함한다. 데이터 세트의 특성은 데이터 세트의 폭, 높이, 희소성 또는 밀도; 데이터 세트의 타겟 및/또는 피처의 수, 데이터 세트 변수의 데이터 유형(예를 들어, 숫자, 서수, 카테고리 또는 인터프리팅(예를 들어, 날짜, 시간, 텍스트 등); 데이터 세트의 숫자 변수의 범위; 데이터 세트의 서수 및 카테고리 변수에 대한 클래스의 수; 등을 포함할 수 있지만 이에 한정되지 않는다.In some embodiments, predictive modeling system 100 includes a library of predictive problems (not shown in FIG. 1 ). The library of prediction problems may contain data representative of characteristics of the prediction problem. In some embodiments, the data representative of the characteristics of the prediction problem comprises data representative of the characteristics of the data set representative of the prediction problem. A characteristic of a data set may be the width, height, sparseness, or density of the data set; the number of targets and/or features in the data set, the data type of the data set variable (e.g., numeric, ordinal, category, or interpreting (e.g., date, time, text, etc.); range; the number of classes for ordinal and categorical variables in the data set; and the like.

일부 실시예에서, 데이터 세트의 특성은 전체 관측치의 수; 관측치에 걸친 각각의 변수에 대한 고유값의 수; 관측치에 걸친 각각의 변수의 누락값의 수; 아웃라이어(outlier) 및 인라이어(inlier)의 존재 및 범위; 각각의 변수의 값 또는 클래스 멤버십의 분산의 속성; 변수의 기수성; 등에 한정되지 않지만 이를 포함하는 데이터 세트의 변수의 통계적 속성을 포함한다. 일부 실시예에서, 데이터 세트의 특성은 변수의 그룹의 공동 분산; 하나 이상의 타겟에 대한 하나 이상의 피처의 가변적인 중요도(예를 들어, 피처 및 타겟 변수 간의 상관도); 2개 이상의 피처 간의 통계적 관계(예를 들어, 2개의 피처 간의 다중공선성(multicollinearity) 정도); 등에 한정되지 않지만 이를 포함하는 데이터 세트의 변수 간의 관계(예를 들어, 통계적 관계)를 포함한다.In some embodiments, a characteristic of a data set is the total number of observations; the number of eigenvalues for each variable across the observations; the number of missing values for each variable across the observations; the presence and extent of outliers and inliers; an attribute of the variance of the value of each variable or class membership; cardinality of variables; including, but not limited to, statistical properties of variables in data sets including them. In some embodiments, a characteristic of a data set is a joint variance of a group of variables; variable importance of one or more features to one or more targets (eg, correlations between features and target variables); a statistical relationship between two or more features (eg, the degree of multicollinearity between the two features); including, but not limited to, relationships (eg, statistical relationships) between variables in a data set comprising the same.

일부 실시예에서, 예측 문제의 특성을 나타내는 데이터는 예측 문제의 주제를 나타내는 데이터(예를 들어, 금융, 보험, 방위, 전자-상거래, 소매, 인터넷-기반 광고, 인터넷-기반 추천 엔진 등); 변수의 유래(예를 들어, 각각의 변수가 자동화 기구로부터, 자동화 계구의 인간의 기록으로부터, 인간의 측정으로부터, 기재된 인간의 응답으로부터, 구두의 인간의 반응 등으로부터 직접 획득되었는지 여부); 예측 문제에 대해 알려진 예측 모델링 해결책의 존재와 성능; 등을 포함한다.In some embodiments, data representative of the nature of the prediction problem may include data representative of the subject of the prediction problem (eg, finance, insurance, defense, e-commerce, retail, internet-based advertising, internet-based recommendation engine, etc.); origin of the variable (eg, whether each variable was obtained directly from an automated instrument, from a human record of an automated instrument, from a human measurement, from a recorded human response, an oral human response, etc.); the existence and performance of known predictive modeling solutions to predictive problems; etc.

일부 실시예에서, 예측 모델링 시스템(100)은 시계열 예측 문제(예를 들어, 일차원 또는 다차원 시계열 예측 문제)를 지원할 수 있다. 시계열 예측 문제의 경우, 목적은 일반적으로 타겟 자체를 포함하여 모든 피처의 이전 관측치 값의 함수로서 타겟의 장래값을 예측하는 것이다. 예측 문제의 특성을 나타내는 데이터는, 예측 문제가 시계열 예측 문제인지 여부를 표시하고, 시계열 예측 문제에 대응하는 데이터 세트의 시간 측정 변수를 식별함으로써, 시계열 예측 문제를 수용할 수 있다.In some embodiments, the predictive modeling system 100 may support time series prediction problems (eg, one-dimensional or multi-dimensional time series prediction problems). For time-series prediction problems, the goal is usually to predict the future value of a target as a function of previous observed values of all features, including the target itself. Data representative of the nature of the prediction problem can accommodate the time-series prediction problem by indicating whether the prediction problem is a time-series prediction problem and identifying temporal variables in the data set corresponding to the time-series prediction problem.

일부 실시예에서, 예측 문제의 라이브러리는 예측 문제 간의 유사성(또는 차이)을 평가하기 위한 도구를 포함한다. 이러한 도구는 2개의 예측 문제 간의 유사성을 (예를 들어, 미리 정해진 스케일 상의) 스코어, 분류(예를 들어, "매우 유사", "다소 유사", "다소 비유사", "매우 비유사"), 바이너리 결정(예를 들어, "유사" 또는 "비유사")으로서 표현할 수 있다. 이러한 도구는 동일하거나 유사한 예측 모델링 기술을 예측 문제에 적용한 결과를 나타내는 데이터 등에 기초하여, 예측 문제의 특성을 나타내는 데이터에 기초하여 2개의 예측 문제 간의 유사성을 결정할 수 있다. 예를 들어, 다수의(또는 높은 퍼센티지의) 특성을 공통으로 갖고 및/또는 동일하거나 유사한 예측 모델링 기술에 영향을 받기 쉬운 데이터 세트에 의해 나타내어지는 2개의 예측 문제에 있어서, 도구는 예측 문제에 높은 유사성 스코어를 할당하거나 예측 문제를 "매우 유사"로 분류할 수 있다.In some embodiments, the library of prediction problems includes tools for evaluating similarities (or differences) between prediction problems. These tools score similarities (eg, on a predetermined scale) between two prediction problems, classify them (eg, "very similar", "somewhat similar", "somewhat dissimilar", "very dissimilar"). , can be expressed as a binary crystal (eg, “like” or “dislike”). Such a tool may determine the similarity between two predictive problems based on data representative of the characteristics of the predictive problem, etc., based on data representative of the results of applying the same or similar predictive modeling techniques to the predictive problem. For example, for two prediction problems represented by data sets that have many (or a high percentage of) characteristics in common and/or are susceptible to the same or similar predictive modeling techniques, the tool is highly sensitive to the prediction problem. You can assign a similarity score or classify the prediction problem as "very similar".

도 2는 예측 모델링 기술을 인코딩하는 머신-실행 가능 템플릿을 구축하고 일부 실시예에 따라 이러한 템플릿을 예측 모델링 방법으로 통합하기에 적절한 모델링 도구(200)의 블록도를 나타낸다. 사용자 인터페이스(120)는 모델링 도구(200)에 인터페이스를 제공할 수 있다.2 shows a block diagram of a modeling tool 200 suitable for building machine-executable templates encoding predictive modeling techniques and incorporating such templates into predictive modeling methods in accordance with some embodiments. The user interface 120 may provide an interface to the modeling tool 200 .

도 2의 예에서, 도 2에서, 모델링 방법 빌더(210)는 모델링 기술의 라이브러리(130)의 톱(top) 상에 모델링 방법의 라이브러리(212)를 구축한다. 모델링 기술 빌더(220)는 모델링 작업의 라이브러리(232)의 톱 상에 모델링 기술의 라이브러리(130)를 구축한다. 모델링 방법은, 어떤 모델링 기술이 어떤 상황에서 효과적인지 대한 하나 이상의 분석자의 직관 및 경험에 대응할 수 있고, 및/또는 예측 문제에 대한 모델링 검색 공간의 탐색을 가이드하기 위해 이전 예측 문제에 대한 모델링 기술의 적용의 결과를 활용할 수 있다. 모델링 기술은 특정 모델링 알고리즘을 적용하기 위한 단계별 레시피에 대응할 수 있다. 모델링 작업은 모델링 기술 내의 프로세싱 단계에 대응할 수 있다.In the example of FIG. 2 , in FIG. 2 , the modeling method builder 210 builds a library 212 of modeling methods on top of the library 130 of modeling techniques. The modeling technique builder 220 builds the library 130 of modeling techniques on top of the library 232 of modeling tasks. Modeling methods may correspond to one or more analyst's intuition and experience as to which modeling techniques are effective in which circumstances, and/or use modeling techniques for prior prediction problems to guide exploration of the modeling search space for prediction problems. The results of application can be used. A modeling technique may correspond to a step-by-step recipe for applying a specific modeling algorithm. A modeling task may correspond to a processing step within a modeling technique.

일부 실시예에서, 모델링 기술은 작업의 계층을 포함할 수 있다. 예를 들어, 톱-레벨 "텍스트 마이닝" 작업은 (a) 문서-용어 매트릭스를 생성하고 (b) 용어를 등급화하고 중요하지 않은 용어를 탈락시키는 하위-작업을 포함할 수 있다. 이어서, "용어 등급화 및 탈락" 하위-작업은 (b.1) 등급 모델을 구축하고 (b.2) 문서-용어 매트릭스로부터 열(column)을 탈락시키기 위해 용어 등급을 사용하기 위한 하위-작업을 포함할 수 있다. 이러한 계층은 임의의 깊이를 가질 수 있다.In some embodiments, modeling techniques may include hierarchies of tasks. For example, a top-level “text mining” task may include sub-tasks that (a) generate a document-term matrix and (b) rank terms and drop unimportant terms. Then, the “grading and dropping terms” sub-task is a sub-task for (b.1) building a rating model and (b.2) using term ratings to drop columns from the document-term matrix. may include These layers can have any depth.

도 2의 예에서, 도 2에서, 모델링 도구(200)는 모델링 작업 빌더(230), 모델링 기술 빌더(220) 및 모델링 방법 빌더(210)를 포함한다. 각각의 빌더는 머신-실행 가능 포맷으로 모델링 요소 중 하나를 인코딩하기 위한 도구 또는 도구의 세트를 포함할 수 있다. 각각의 빌더는, 사용자가 기존 모델링 요소를 수정하거나 새 모델링 요소를 생성하도록 허용할 수 있다. 도 2에 나타내어진 모델링 층에 걸쳐 모델링 요소의 완전한 라이브러리를 구성하기 위해, 개발자는 하향식, 상향식, 외향식, 내향식, 또는 조합된 전략을 채용할 수 있다. 그러나, 논리적 의존성의 관점에서, 리프(leaf)-레벨 작업이 가장 작은 모델링 요소이므로, 도 2는 머신-실행 가능 템플릿을 구성하는 프로세스의 제1 단계로서 작업 생성을 도시한다.In the example of FIG. 2 , in FIG. 2 , the modeling tool 200 includes a modeling task builder 230 , a modeling technique builder 220 and a modeling method builder 210 . Each builder may include a tool or set of tools for encoding one of the modeling elements in a machine-executable format. Each builder may allow users to modify existing modeling elements or create new modeling elements. To construct a complete library of modeling elements across the modeling layers shown in Figure 2, developers can employ top-down, bottom-up, extroverted, introverted, or combined strategies. However, in terms of logical dependencies, since leaf-level tasks are the smallest modeling elements, Figure 2 shows task creation as the first step in the process of constructing a machine-executable template.

각각의 빌더의 사용자 인터페이스는 표준 프로그래밍 언어로 특수화된 루틴의 컬렉션, 빌더의 요소를 인코딩할 목적으로 특별히 설계된 형식 문법, 원하는 실행 흐름을 추상적으로 특정하기 위한 리치 유저 인터페이스 등에 한정되지는 않지만 이를 사용하여 구현될 수 있다 그러나, 각 층에서 허용되는 연산의 논리적 구조는 임의의 특정 인터페이스와 독립적이다.Each builder's user interface is not limited to, but is not limited to, a collection of routines specialized in a standard programming language, a formal syntax specifically designed for the purpose of encoding the builder's elements, and a rich user interface to abstractly specify the desired flow of execution. However, the logical structure of the operations allowed in each layer is independent of any particular interface.

계층 구조의 리프-레벨에서 모델링 작업을 생성할 때, 모델링 도구(200)는, 개발자가 다른 소스로부터 소프트웨어 컴포넌트를 통합하는 것을 허용할 수 있다. 이 기능은 통계 학습과 관련된 소프트웨어의 설치된 베이스 및 이러한 소프트웨어를 어떻게 개발하는지에 대한 축적된 지식을 활용한다. 이러한 설치된 베이스는 과학 프로그래밍 언어(예를 들어, Fortran), 범용 프로그래밍 언어(예를 들어, C)로 작성된 과학 루틴, 범용 프로그래밍 언어(예를 들어, Python용 사이킷-학습(scikit-learn))의 과학 컴퓨팅 확장, 상업용 통계 환경(예를 들어, SAS/STAT) 및 오픈 소스 통계 환경(예를 들어, R)을 커버한다. 이러한 소프트웨어 컴포넌트의 기능을 통합하는 데 사용될 때, 모델링 작업 빌더(230)는 소프트웨어 컴포넌트의 입력 및 출력의 사양 및/또는 소프트웨어 컴포넌트가 수행할 수 있는 동작 유형의 특성화를 요구할 수 있다. 일부 실시예에서, 모델링 작업 빌더(230)는 소프트웨어 컴포넌트의 소스 코드 서명을 검사하거나, 소프트웨어 컴포넌트의 인터페이스 정의를 저장소로부터 리트리빙(retrieving)하거나, 요청 시퀀스로 소프트웨어 컴포넌트를 프로빙하거나, 다른 형태의 자동화 평가를 수행함으로써 이 메타데이터를 생성한다. 일부 실시예에서, 개발자는 이 메타데이터의 일부 또는 전부를 수동으로 공급한다.When creating modeling tasks at the leaf-level of a hierarchy, modeling tool 200 may allow developers to integrate software components from other sources. This function utilizes the installed base of software related to statistical learning and accumulated knowledge of how to develop such software. These installed bases include scientific programming languages (eg Fortran), scientific routines written in general-purpose programming languages (eg C), and general-purpose programming languages (eg scikit-learn for Python). of scientific computing extensions, commercial statistical environments (eg, SAS/STAT), and open source statistical environments (eg, R). When used to incorporate the functionality of such software components, modeling task builder 230 may require a specification of the input and output of the software component and/or characterization of the types of operations that the software component may perform. In some embodiments, modeling task builder 230 checks a software component's source code signature, retrieves the software component's interface definition from a repository, probes the software component with a sequence of requests, or performs other forms of automation. You create this metadata by performing an evaluation. In some embodiments, the developer manually supplies some or all of this metadata.

일부 실시예에서, 모델링 작업 빌더(230)는 이 메타데이터를 사용하여 통합된 소프트웨어를 실행할 수 있게 하는 "래퍼(wrapper)"를 생성한다. 모델링 작업 빌더(230)는 컴포넌트의 소스 코드를 내부 실행 가능자로 컴파일링하고, 컴포넌트의 목적 코드를 내부 실행자에 링크하고, 컴포넌트의 독립형 실행자에 의해 예상되는 컴퓨팅 환경의 에뮬레이터를 통해 컴포넌트에 액세스하고, 로컬 머신 상의 소프트웨어 서비스의 일부로서 실행되는 컴포넌트의 기능에 액세스하고, 원격 머신 상의 소프트웨어 서비스의 일부로서 실행되는 컴포넌트의 기능에 액세스하고, 로컬 또는 원격 머신 상에서 실행되는 중간 소프트웨어 서비스를 통해 컴포넌트의 기능에 액세스하는 등에 한정되지 않지만 이를 포함하여 소프트웨어 컴포넌트를 통합하기 위해 임의의 메커니즘을 이용하는 이러한 래퍼를 구현할 수 있다. 래퍼가 생성된 후에 모델링 작업 빌더(230)가 어떠한 통합 메커니즘을 사용하든지, 모델링 도구(200)는 임의의 다른 루틴처럼 컴포넌트에 소프트웨어를 호출할 수 있다.In some embodiments, modeling job builder 230 uses this metadata to create a “wrapper” that enables the integrated software to run. The modeling job builder 230 compiles the component's source code into an internal executable, links the component's object code to the internal executor, accesses the component through an emulator of the computing environment expected by the component's standalone executor, Accessing the functionality of a component running as part of a software service on a local machine, accessing the functionality of a component running as part of a software service on a remote machine, and accessing the functionality of a component through an intermediate software service running on a local or remote machine. Such wrappers may be implemented using any mechanism to incorporate software components, including but not limited to accessing, and the like. Whatever integration mechanism modeling task builder 230 uses after the wrapper is created, modeling tool 200 may call the software into the component like any other routine.

일부 실시예에서, 개발자는 리프-레벨 모델링 작업을 재귀적으로 상위-레벨 작업으로 어셈블링하기 위해 모델링 작업 빌더(230)를 사용할 수 있다. 전술한 바와 같이, 작업 계층 구조의 배열을 특정하기 위해 사용자 인터페이스를 구현하는 많은 상이한 방식들이 존재한다. 그러나 논리적 관점에서 볼 때, 리프-레벨이 아닌 작업은 하위-작업의 지향 그래프를 포함할 수 있다. 이 계층 구조의 톱 및 중간 레벨 각각에는, 그 입력이 계층 구조의 부모(parent) 작업(또는 계층 구조의 톱 레벨에 있는 부모 모델링 기술)으로부터 온 하나의 개시 하위-작업이 있을 수 있다. 계층 구조에서 그 출력이 부모 작업(또는 계층 구조의 톱 레벨에 있는 부모 모델링 기술)으로 향하는 하나의 종료 하위-작업이 또한 있을 수 있다. 주어진 레벨에서의 다른 모든 하위-작업은 하나 이상의 이전 하위-작업으로부터 입력을 수신하고, 하나 이상의 후속 하위-작업으로 출력을 전송한다.In some embodiments, a developer may use the modeling task builder 230 to recursively assemble leaf-level modeling tasks into higher-level tasks. As mentioned above, there are many different ways of implementing a user interface for specifying the arrangement of a task hierarchy. However, from a logical point of view, non-leaf-level tasks can contain directed graphs of sub-tasks. At each of the top and middle levels of this hierarchy, there may be one starting sub-task whose input is from the parent task of the hierarchy (or the parent modeling technique at the top level of the hierarchy). There may also be one end sub-task in the hierarchy whose output is directed to the parent task (or the parent modeling technique at the top level of the hierarchy). All other sub-tasks at a given level receive input from one or more previous sub-tasks and send output to one or more subsequent sub-tasks.

리프-레벨 작업에 임의의 코드를 통합하는 능력과 조합하여, 지향 그래프에 따라 데이터를 전파하는 것은 중간-레벨 작업 내에서 임의의 제어 흐름의 구현을 용이하게 한다. 일부 실시예에서, 모델링 도구(200)는 추가적인 빌트인(built-in) 동작을 제공할 수 있다. 예를 들어, 임의의 특정 조건부 로직을 외부 프로그래밍 언어로 코딩된 리프-레벨 작업으로서 구현하는 것이 간단하지만, 모델링 작업 빌더(230)는 일반적인 방식으로 조건부 평가를 수행하는 빌트인 노드 또는 아크(arc)를 제공할 수 있어, 이러한 평가의 결과에 기초하여 노드로부터의 데이터의 일부 또는 전부를 상이한 후속 노드로 향하게 한다. 후속 노드에 대한 입력으로서 전파하기 전에 규칙 또는 표현에 따라 한 노드로부터의 출력을 필터링하고, 한 노드로부터의 출력을 후속 노드에 대한 입력으로 전파하기 전에 변환하고, 각 분할을 각각의 후속 노드로 전파하기 전에 규칙 또는 표현에 따라 한 노드로부터의 출력을 분할하고, 입력으로서 수용하기 전에 규칙 또는 공식에 따라 복수의 이전 노드의 출력을 조합하고, 하나 이상의 루프 변수를 사용하여 노드의 동작의 서브-그래프를 반복적으로 적용하는 것 등을 위한 유사한 대안이 존재한다.Propagating data according to directed graphs, in combination with the ability to incorporate arbitrary code into leaf-level operations, facilitates the implementation of arbitrary control flows within mid-level operations. In some embodiments, the modeling tool 200 may provide additional built-in operations. For example, while it is straightforward to implement any particular conditional logic as a leaf-level task coded in an external programming language, the modeling task builder 230 provides built-in nodes or arcs that perform conditional evaluation in the usual way. may provide, directing some or all of the data from a node to a different subsequent node based on the results of this evaluation. Filter the output from one node according to a rule or expression before propagating as input to subsequent nodes, transform the output from one node before propagating as input to subsequent nodes, and propagate each split to each subsequent node Split the output from one node according to a rule or expression before, combine the outputs of a plurality of previous nodes according to a rule or formula before accepting it as an input, and use one or more loop variables to sub-graph the node's behavior Similar alternatives exist for iterative application of

일부 실시예에서, 개발자는 모델링 기술 빌더(220)를 사용하여 모델링 작업 라이브러리(232)로부터의 작업을 모델링 기술로 어셈블링할 수 있다. 모델링 작업 라이브러리(232)의 적어도 일부 모델링 작업은 하나 이상의 모델링 기술의 사전-프로세싱 단계, 모델-맞춤 단계 및/또는 사후-프로세싱 단계에 대응할 수 있다. 작업 및 기술의 개발은 작업 라이브러리(232)가 채워진 후에 기술이 어셈블링되는 선형 패턴 또는 작업 및 기술이 동시에 어셈블링되는 보다 동적인 원형 패턴을 따를 수 있다. 개발자는 기존 작업을 새로운 기술로 조합하도록 고무되고, 이 기술이 새로운 작업을 필요로 한다는 것을 인식하고 새로운 기술이 완료될 때까지 반복적으로 정제(refine)할 수 있다. 대안적으로, 개발자는 아마도 학술지 발행으로부터의 새로운 기술의 개념으로 시작하여 새로운 작업으로부터 이를 구축할 수 있지만, 적절한 기능을 제공할 때 모델링 작업 라이브러리(232)로부터 기존 작업을 끌어올 수 있다. 모든 경우에, 참조 데이터 세트 또는 필드 테스트에 모델링 기술을 적용한 결과는 개발자 나 분석자가 기술의 성능을 평가할 수 있게 할 것이다. 이러한 평가는, 차례로 리프-레벨 모델링 작업으로부터 모델링 기술에 이르기까지 계층 구조에서의 어느 지점에서의 변화로 귀결될 수 있다. 모델링 도구(200)는 공통 모델링 작업 및 모델링 기술 라이브러리(232, 130)뿐만 아니라 고생산성 빌더 인터페이스(210, 220 및 230)를 제공함으로써, 개발자가 신속하고 정확하게 변경할 수 있을 뿐만 아니라 라이브러리(232, 130)에 대한 액세스를 갖는 다른 개발자 및 사용자에게 이러한 향상을 전파할 수 있다.In some embodiments, a developer may use modeling skills builder 220 to assemble tasks from modeling tasks library 232 into modeling techniques. At least some modeling tasks in the modeling tasks library 232 may correspond to pre-processing steps, model-fitting steps, and/or post-processing steps of one or more modeling techniques. The development of tasks and skills may follow a linear pattern in which the skills are assembled after the work library 232 is populated, or a more dynamic circular pattern in which tasks and skills are assembled simultaneously. Developers are encouraged to combine existing work into new skills, recognizing that these skills require new work, and can iteratively refine new skills until they are complete. Alternatively, a developer may start with a concept of a new technology, perhaps from a journal publication, and build it from a new task, but draw existing work from the modeling task library 232 when providing the appropriate functionality. In all cases, the results of applying a modeling technique to a reference data set or field test will enable the developer or analyst to evaluate the performance of the technique. This evaluation, in turn, can result in changes at any point in the hierarchy, from leaf-level modeling tasks to modeling techniques. The modeling tool 200 provides a high-productivity builder interface 210, 220 and 230, as well as a common modeling task and modeling technology library 232, 130, so that developers can quickly and accurately make changes, as well as libraries 232, 130 ) to propagate these enhancements to other developers and users who have access to

모델링 기술은 개발자 및 분석자가 전체 예측 모델링 절차를 개념화하기 위한 초점을 제공할 수 있으며, 모든 단계는 필드의 최상 실시에 기초하여 예상된다. 일부 실시예에서, 모델링 기술은 통계적 학습 훈련으로부터의 최상 실시를 캡슐화한다. 또한, 모델링 도구(200)는, 예를 들어, 누락 작업을 검출하고, 추가 작업을 검출하고 및/또는 단계들 중의 비정상적인 흐름을 검출하는 기존의 기술의 작업 그래프에 대해 새로운 기술에 대한 작업 그래프를 개발자가 고려 및 비교하기 위해 예를 들어, 단계의 체크리스트를 제공함으로써 고품질 기술의 개발의 안내 지침을 제공할 수 있다.Modeling techniques can provide a focus for developers and analysts to conceptualize the entire predictive modeling process, with every step expected based on best practice in the field. In some embodiments, modeling techniques encapsulate best practices from statistical learning training. In addition, the modeling tool 200 may generate a working graph for a new technique relative to a working graph of an existing technique, for example, detecting missing jobs, detecting additional jobs, and/or detecting abnormal flow during steps. It can provide guiding guidance in the development of high-quality technologies by, for example, providing a checklist of steps for developers to consider and compare.

일부 실시예에서, 탐색 엔진(110)은 모델링 기술 라이브러리(130)의 기술을 사용하여 데이터 세트(240)에 대한 예측 모델을 구축하는 데 사용된다. 탐색 엔진(110)은 모델링 방법 라이브러리(212)로부터 선택된 모델링 방법에 의해 인코딩된 우선 순위 방식에 기초하여 모델링 기술 라이브러리(130)에서 모델링 기술의 평가를 우선 순위화할 수 있다. 모델링 공간의 탐색을 위한 적절한 우선 순위 방식의 예가 다음 섹션에서 설명된다. 도 2의 예에서, 모델링 공간의 탐색 결과는 모델링 작업 및 기술과 연관된 메타데이터를 갱신하는 데 사용될 수 있다.In some embodiments, the search engine 110 is used to build a predictive model for the data set 240 using techniques from the modeling techniques library 130 . The search engine 110 may prioritize the evaluation of the modeling technique in the modeling technique library 130 based on the priority scheme encoded by the modeling method selected from the modeling method library 212 . An example of a suitable prioritization scheme for exploration of the modeling space is described in the next section. In the example of FIG. 2 , the search results of the modeling space may be used to update metadata associated with modeling tasks and techniques.

일부 실시예에서, 고유 식별자(ID)가 모델링 요소(예를 들어, 기술, 작업 및 하위-작업)에 할당될 수 있다. 모델링 요소의 ID는 모델링 요소의 템플릿과 연관된 메타데이터로서 저장될 수 있다. 일부 실시예에서, 이들 모델링 엘리먼트 ID는 하나 이상의 모델링 작업 또는 하위-작업을 공유하는 모델링 기술을 효율적으로 실행하는 데 사용될 수 있다. 모델링 기술을 효율적으로 실행하는 방법이 아래에서 더욱 상세하게 설명된다.In some embodiments, unique identifiers (IDs) may be assigned to modeling elements (eg, descriptions, tasks, and sub-tasks). The ID of the modeling element may be stored as metadata associated with the template of the modeling element. In some embodiments, these modeling element IDs may be used to efficiently execute modeling techniques that share one or more modeling tasks or sub-tasks. Methods of efficiently implementing modeling techniques are described in more detail below.

도 2의 예에서, 탐색 엔진(110)에 의해 생성된 모델링 결과는 모델링 작업 빌더(230), 모델링 기술 빌더(220) 및 모델링 방법 빌더(210)로 피드백된다. 모델 빌더는 모델링 결과에 기초하여 (예를 들어, 통계적 학습 알고리즘을 사용하여) 자동으로 또는 (예를 들어, 사용자에 의해) 수동으로 적응될 수 있다. 예를 들어, 모델링 방법 빌더(210)는 모델링 결과에서 관측된 패턴에 기초하여 및/또는 데이터 분석자의 경험에 기초하여 적응될 수 있다. 유사하게, 특정 모델링 기술을 실행한 결과는 이들 모델 내의 기술 또는 작업에 대한 디폴트 튜닝 파라미터값의 자동 또는 수동 조정을 알릴 수 있다. 일부 실시예에서, 모델링 빌더의 적응은 반자동일 수 있다. 예를 들어, 예측 모델링 시스템(100)은 방법, 기술 및/또는 작업에 대한 잠재적 개선을 플래깅할 수 있으며, 사용자는 이러한 잠재적 개선을 구현할지 여부를 결정할 수 있다.In the example of FIG. 2 , the modeling result generated by the search engine 110 is fed back to the modeling job builder 230 , the modeling technique builder 220 , and the modeling method builder 210 . The model builder may be adapted automatically (eg, using a statistical learning algorithm) or manually (eg, by a user) based on modeling results. For example, the modeling method builder 210 may be adapted based on observed patterns in the modeling results and/or based on the experience of the data analyst. Similarly, the results of executing certain modeling techniques may inform automatic or manual adjustment of default tuning parameter values for techniques or tasks within these models. In some embodiments, the adaptation of the modeling builder may be semi-automatic. For example, predictive modeling system 100 may flag potential improvements to methods, techniques, and/or operations, and a user may determine whether to implement such potential improvements.

모델링 공간 탐색 엔진modeling space exploration engine

도 3은 일부 실시예에 따라 예측 문제에 대한 예측 모델을 선택하기 위한 방법(300)의 흐름도이다. 일부 실시예에서, 방법(300)은 모델링 방법 라이브러리(212)에서의 모델링 방법에 대응할 수 있다.3 is a flow diagram of a method 300 for selecting a predictive model for a predictive problem in accordance with some embodiments. In some embodiments, method 300 may correspond to a modeling method in modeling method library 212 .

방법(300)의 단계(310)에서, 예측 문제에 대한 복수의 예측 모델링 절차(예를 들어, 예측 모델링 기술)의 적절성이 결정된다. 예측 모델링 절차의 예측 문제에 대한 적절성은 예측 문제의 특성, 모델링 절차의 속성, 및/또는 다른 적절한 정보에 기초하여 결정될 수 있다.At step 310 of method 300, the relevance of a plurality of predictive modeling procedures (eg, predictive modeling techniques) for a predictive problem is determined. The suitability of a predictive modeling procedure for a predictive problem may be determined based on the nature of the predictive problem, the nature of the modeling procedure, and/or other suitable information.

예측 문제에 대한 예측 모델링 절차의 "적절성"은 예측 모델링 절차를 사용하여 생성된 예측 모델의 예측 문제에 대한 예상된 성능을 나타내는 데이터를 포함할 수 있다. 일부 실시예에서, 예측 문제에 대한 예측 모델의 예상되는 성능은 하나 이상의 예상되는 스코어(예를 들어, 하나 이상의 목적 함수의 예상값) 및/또는 (예를 들어, 다른 예측 모델링 기술을 사용하여 생성된 다른 예측 모델에 대한) 하나 이상의 예상 등급을 포함한다.The "relevance" of a predictive modeling procedure for a predictive problem may include data indicative of the expected performance of a predictive model generated using the predictive modeling procedure for a predictive problem. In some embodiments, the expected performance of a predictive model on a predictive problem is generated using one or more expected scores (eg, predicted values of one or more objective functions) and/or (eg, using other predictive modeling techniques). It contains one or more predicted grades) for other predictive models that have been

대안적으로 또는 추가적으로, 예측 문제에 대한 예측 모델링 절차의 "적절성"은, 모델링 절차가 예측 문제에 대해 적절한 성능을 제공하는 예측 모델을 생성할 것으로 예상되는 정도를 나타내는 데이터를 포함할 수 있다. 일부 실시예에서, 예측 모델링 절차의 "적절성" 데이터는 모델링 절차의 적절성의 분류를 포함한다. 분류 스킴은 2개의 클래스(예를 들어, "적절" 또는 "비적절") 또는 2개 초과의 클래스(예를 들어, "매우 적절", "중간 정도로 적절", "중간 정도로 부적절", "매우 부적절")를 가질 수 있다.Alternatively or additionally, the “relevance” of a predictive modeling procedure for a predictive problem may include data indicating the degree to which the modeling procedure is expected to produce a predictive model that provides adequate performance for the predictive problem. In some embodiments, the “relevance” data of the predictive modeling procedure includes a classification of the relevance of the modeling procedure. A classification scheme can be divided into two classes (e.g., "adequate" or "inappropriate") or more than two classes (e.g., "very appropriate", "moderately appropriate", "moderately inadequate", "very inappropriate").

일부 실시예에서, 탐색 엔진(110)은 본원에 설명되는 (그러나 이에 한정되는 것은 아님) 특성을 포함하여, 예측 문제의 하나 이상의 특성에 적어도 부분적으로 기초하여 예측 문제에 대한 예측 모델링 절차의 적절성을 결정한다. 일례로서, 예측 문제에 대한 예측 모델링 절차의 적절성은 예측 문제에 대응하는 데이터 세트의 특성, 예측 문제에 대응하는 데이터 세트의 변수의 특성, 데이터 세트의 변수들 간의 관계 및/또는 예측 문제의 주제에 기초하여 결정될 수 있다. 탐색 엔진(110)은 예측 문제, 데이터 세트, 데이터 세트 변수 등의 특성을 결정하기 위해 예측 문제와 연관된 데이터 세트를 분석하기 위한 도구(예를 들어, 통계 분석 도구)를 포함할 수 있다.In some embodiments, the search engine 110 determines the suitability of a predictive modeling procedure for a prediction problem based at least in part on one or more characteristics of the prediction problem, including but not limited to the characteristics described herein. decide As an example, the adequacy of a predictive modeling procedure for a prediction problem depends on the characteristics of the data set corresponding to the prediction problem, the characteristics of the variables in the data set corresponding to the prediction problem, the relationships between the variables in the data set, and/or the subject of the prediction problem. can be determined based on Search engine 110 may include tools (eg, statistical analysis tools) for analyzing data sets associated with the prediction problem to determine characteristics of the prediction problem, data set, data set variables, and the like.

일부 실시예에서, 탐색 엔진(110)은 (이에 한정되는 것은 아니지만) 본원에 설명되는 예측 모델링 절차의 속성을 포함하여, 예측 모델링 절차의 하나 이상의 속성에 적어도 부분적으로 기초하여 예측 문제에 대한 예측 모델링 절차의 적절성을 결정한다. 일례로서, 예측 문제에 대한 예측 모델링 절차의 적절성은 예측 모델링 절차에 의해 수행된 데이터 프로세싱 기술 및/또는 예측 모델링 절차에 의해 부과된 데이터 프로세싱 제약에 기초하여 결정될 수 있다.In some embodiments, the search engine 110 provides predictive modeling for a predictive problem based, at least in part, on one or more attributes of the predictive modeling procedure, including, but not limited to, attributes of the predictive modeling procedure described herein. determine the adequacy of the procedure; As an example, the adequacy of a predictive modeling procedure for a predictive problem may be determined based on data processing techniques performed by the predictive modeling procedure and/or data processing constraints imposed by the predictive modeling procedure.

일부 실시예에서, 예측 문제에 대한 예측 모델링 절차의 적절성을 결정하는 것은 예측 문제에 대한 고려로부터 적어도 하나의 예측 모델링 절차를 제거하는 것을 포함한다. 고려로부터 예측 모델링 절차를 제거하기로 한 결정은 본원에서 제거된 모델링 절차를 "정리(pruning)" 및/또는 "검색 공간 정리"로 칭해질 수 있다. 일부 실시예에서, 사용자는 모델링 절차를 정리하기로 한 탐색 엔진의 결정을 무시할 수 있어, 이전에 정리된 모델링 절차가 검색 공간의 탐색 중에 추가적인 실행 및/또는 평가에 적격으로 남아 있다.In some embodiments, determining the appropriateness of a predictive modeling procedure for the predictive problem comprises removing at least one predictive modeling procedure from consideration for the predictive problem. A decision to remove a predictive modeling procedure from consideration may be referred to herein as "pruning" and/or "search space theorem" the modeling procedure removed. In some embodiments, the user may override the search engine's decision to refine the modeling procedure so that the previously compiled modeling procedure remains eligible for further execution and/or evaluation during exploration of the search space.

예측 모델링 절차는 예측 모델링 절차의 속성 및 예측 문제의 특성에 하나 이상의 연역적 규칙을 적용한 결과에 기초하여 고려에서 제외될 수 있다. 연역적 규칙은 다음을 포함할 수 있지만 이에 한정되지 않는다: (1) 예측 문제가 카테고리 타겟 변수를 포함하는 경우, 실행을 위한 분류 기술만을 선택한다; (2) 데이터 세트의 숫자 피처가 크게 상이한 크기 범위에 걸쳐 있는 경우, 정규화를 제공하는 기술을 선택하거나 우선 순위화한다; (3) 데이터 세트가 텍스트 피처를 갖는 경우, 텍스트 마이닝을 제공하는 기술을 선택하거나 우선 순위화한다; (4) 데이터 세트가 관측치보다 많은 피처를 갖는 경우, 피처의 수보다 많거나 동등한 관측치의 수를 필요로 하는 모든 기술을 제거한다; (5) 데이터 세트의 폭이 임계폭을 초과하는 경우, 크기 감소를 제공하는 기술을 선택하거나 우선 순위화한다; (6) 데이터 세트가 크고 희소한 경우(예를 들어, 데이터 세트의 크기가 임계 크기를 초과하고 데이터 세트의 희소성이 임계 희소성을 초과하는 경우), 희소한 데이터 구조에 대해 효율적으로 실행되는 기술을 선택하거나 우선 순위화하고; 및/또는 규칙이 가정문(if-then statement) 형태로 표현될 수 있는 모델링 기술을 선택, 우선 순위화 또는 제거하기 위한 임의의 규칙을 포함한다. 일부 실시예에서, 연역적 규칙은 연쇄되어, 몇몇 규칙의 순차적 실행이 결론을 생성한다. 일부 실시예에서, 연역적 규칙은 이력적 성능에 기초하여 갱신, 정제 또는 개선될 수 있다.A predictive modeling procedure may be excluded from consideration based on the properties of the predictive modeling procedure and the results of applying one or more deductive rules to the nature of the prediction problem. Deductive rules may include, but are not limited to: (1) if the prediction problem includes a categorical target variable, select only classification techniques for execution; (2) select or prioritize techniques that provide normalization when the numeric features of the data set span widely different size ranges; (3) select or prioritize techniques that provide text mining if the data set has text features; (4) if the data set has more features than observations, remove all descriptions that require a number of observations greater than or equal to the number of features; (5) select or prioritize techniques that provide size reduction if the width of the data set exceeds a threshold width; (6) If the data set is large and sparse (e.g., the size of the data set exceeds the threshold size and the sparsity of the data set exceeds the critical sparsity), find a technique that runs efficiently on sparse data structures. select or prioritize; and/or any rules for selecting, prioritizing, or eliminating modeling techniques in which the rules may be expressed in the form of if-then statements. In some embodiments, deductive rules are chained, such that sequential execution of several rules produces a conclusion. In some embodiments, a priori rules may be updated, refined, or improved based on historical performance.

일부 실시예에서, 탐색 엔진(110)은 유사한 예측 문제에 대한 유사한 예측 모델링 절차들의 수행(예상 또는 실제)에 기초하여 예측 문제에 대한 예측 모델링 절차의 적절성을 결정한다(특별한 경우로서, 탐색 엔진(110)은 유사한 예측 문제에 대한 동일한 예측 모델링 절차의 성능(예상 또는 실제)에 기초하여 예측 문제에 대한 예측 모델링 절차의 적절성을 결정할 수 있음).In some embodiments, the search engine 110 determines the adequacy of a predictive modeling procedure for a predictive problem based on performance (expected or actual) of similar predictive modeling procedures for a similar predictive problem (as a special case, the search engine ( 110) may determine the adequacy of a predictive modeling procedure for a predictive problem based on the performance (expected or actual) of the same predictive modeling procedure for a similar predictive problem).

전술한 바와 같이, 모델링 기술의 라이브러리(130)는 예측 모델링 기술들 간의 유사성을 평가하기 위한 도구를 포함할 수 있고, 예측 문제의 라이브러리는 예측 문제들 간의 유사성을 평가하기 위한 도구를 포함할 수 있다. 탐색 엔진(110)은 예측 모델링 절차 및 문제가 되는 예측 모델링 절차 및 예측 문제와 유사한 예측 문제를 식별하기 위해 이 도구를 사용할 수 있다. 예측 문제에 대한 예측 모델링 절차의 적절성을 결정할 목적으로, 탐색 엔진(110)은 문제가 되는 모델링 절차와 가장 유사한 M 모델링 절차를 선택하고, 문제가 되는 모델링 절차에 대한 임계 유사도 값을 초과하는 모든 모델링 절차를 선택하는 것 등을 수행할 수 있다. 마찬가지로, 예측 문제에 대한 예측 모델링 절차의 적절성을 결정할 목적으로, 탐색 엔진(110)은 문제가 되는 예측 문제와 가장 유사한 N 예측 문제를 선택하고, 문제가 되는 예측 문제에 대한 임계치 유사도 값을 초과하는 모든 예측 문제를 선택하는 것 등을 수행할 수 있다.As described above, the library of modeling techniques 130 may include tools for evaluating similarities between predictive modeling techniques, and the library of prediction problems may include tools for evaluating similarities between prediction problems. . Search engine 110 may use this tool to identify predictive modeling procedures and predictive modeling procedures in question and predictive problems that are similar to prediction problems. For the purpose of determining the adequacy of the predictive modeling procedure for the predictive problem, the search engine 110 selects the M modeling procedure most similar to the modeling procedure in question, and any modeling that exceeds a threshold similarity value for the modeling procedure in question. Selecting a procedure, etc. may be performed. Similarly, for the purpose of determining the adequacy of the predictive modeling procedure for the prediction problem in question, the search engine 110 selects the N prediction problems most similar to the prediction problem in question, and selects the N prediction problems that exceed the threshold similarity value for the prediction problem in question. Selecting all prediction problems, and so on.

예측 모델링 절차의 세트 및 모델링 절차 및 문제가 되는 예측 문제와 유사한 예측 문제의 세트에 있어서, 탐색 엔진은 문제가 되는 예측 문제에 대해 문제가 되는 모델링 절차의 예상되는 적절성을 결정하기 위해 유사한 예측 문제에 대한 유사한 모델링 절차의 성능을 조합할 수 있다. 전술한 바와 같이, 모델링 절차의 템플릿은, 대응하는 모델링 절차가 주어진 데이터 세트에 대해 얼마나 잘 수행할 것인지를 평가하는 데 관련된 정보를 포함할 수 있다. 탐색 엔진(110)은 유사한 예측 문제에 대한 유사한 모델링 절차의 성능 값(예상 또는 실제)을 결정하기 위해 모델 성능 메타데이터를 사용할 수 있다. 그 후, 이러한 성능 값은 문제가 되는 예측 문제에 대한 모델링 절차의 적절성의 추정을 생성하기 위해 조합될 수 있다. 예를 들어, 탐색 엔진(110)은 유사한 예측 문제에 대한 유사한 모델링 절차의 성능 값들의 가중화된 합으로서 문제가 되는 모델링 절차의 적절성을 계산할 수 있다.For a set of predictive modeling procedures and a set of modeling procedures and prediction problems that are similar to the prediction problem in question, the search engine is used to evaluate the prediction problem in question to determine the expected adequacy of the modeling procedure in question for the prediction problem in question. It is possible to combine the performance of similar modeling procedures for As mentioned above, a template of a modeling procedure may include information related to evaluating how well a corresponding modeling procedure will perform on a given data set. The search engine 110 may use the model performance metadata to determine performance values (expected or actual) of similar modeling procedures for similar prediction problems. These performance values can then be combined to produce an estimate of the adequacy of the modeling procedure for the prediction problem in question. For example, the search engine 110 may calculate the adequacy of the modeling procedure in question as a weighted sum of performance values of the similar modeling procedure for a similar prediction problem.

일부 실시예에서, 탐색 엔진(110)은, 다른 예측 문제(예를 들어, 문제가 되는 예측 문제와 유사한 예측 문제)에 대한 다양한 모델링 절차(예를 들어, 문제가 되는 모델링 절차와 유사한 모델링 절차)의 결과에 기초하여 예측 문제에 대한 모델링 절차의 적절성을 결정하도록 트레이닝될 수 있는 "메타" 머신-학습 모델의 출력에 적어도 부분적으로 기초하여 예측 문제에 대한 예측 모델링 절차의 적절성을 결정한다. 예측 문제에 대한 예측 모델링 절차의 적절성을 추정하기 위한 머신-학습 모델은, 문제가 되는 예측 문제에 대해 어떤 기술이 가장 성공 가능성이 높은지를 예측하기 위해 머신 학습을 재귀적으로 적용하기 때문에 "메타" 머신-학습 모델이라 칭해질 수 있다. 따라서, 탐색 엔진(110)은 다른 예측 문제를 푼 결과에 대해 트레이닝된 메타-머신-학습 알고리즘을 사용함으로써 예측 문제에 대한 모델링 기술의 적절성의 메타-예측을 생성할 수 있다.In some embodiments, search engine 110 provides various modeling procedures (eg, modeling procedures similar to the modeling procedure in question) for other prediction problems (eg, prediction problems similar to the prediction problem in question). Determine the adequacy of a predictive modeling procedure for a predictive problem based at least in part on the output of a “meta” machine-learning model that can be trained to determine the adequacy of a modeling procedure for a predictive problem based on the results of Machine-learning models for estimating the adequacy of predictive modeling procedures for predictive problems are “meta” because they recursively apply machine learning to predict which techniques are most likely to succeed for the predictive problem in question. It may be referred to as a machine-learning model. Accordingly, the search engine 110 may generate meta-predictions of the adequacy of the modeling technique for the prediction problem by using a meta-machine-learning algorithm trained on the results of solving other prediction problems.

일부 실시예에서, 탐색 엔진(110)은 사용자 입력(예를 들어, 예측 모델링 절차의 적절성에 관한 데이터 분석자의 직관 또는 경험을 나타내는 사용자 입력)에 적어도 부분적으로 기초하여 예측 문제에 대한 예측 모델링 절차의 적절성을 결정할 수 있다.In some embodiments, the search engine 110 determines the predictive modeling procedure for a predictive problem based at least in part on user input (eg, user input indicative of a data analyst's intuition or experience regarding the appropriateness of the predictive modeling procedure). appropriateness can be determined.

도 3을 참조하면, 방법(300)의 단계(320)에서, 예측 모델링 절차의 적어도 한 서브세트가 예측 문제에 대한 모델링 절차의 적절성에 기초하여 선택될 수 있다. 모델링 절차가 적절성 카테고리(예를 들어, "적절" 또는 "부적절", "매우 적절", "중간 정도로 적절", "중간 정도로 부적절" 또는 "매우 부적절" 등)에 할당된 실시예에서, 모델링 절차의 서브세트를 선택하는 것은 하나 이상의 적절성 카테고리에 할당된 모델링 절차(예를 들어, "적절한 카테고리"에 할당된 모든 모델링 절차; "매우 부적절" 카테고리에 할당되지 않은 모든 모델링 절차; 등)를 선택하는 것을 포함할 수 있다.Referring to FIG. 3 , at step 320 of method 300 , at least a subset of predictive modeling procedures may be selected based on the relevance of the modeling procedures to the predictive problem. In embodiments where the modeling procedure is assigned a relevance category (e.g., "adequate" or "inappropriate", "very appropriate", "moderately appropriate", "moderately inadequate" or "very inadequate", etc.), the modeling procedure Choosing a subset of may include

모델링 절차에 적절성 값이 할당되는 실시예에서, 탐색 엔진(110)은 적절성 값에 기초하여 모델링 절차의 서브세트를 선택할 수 있다. 일부 실시예에서, 탐색 엔진(110)은 임계 적절성 스코어를 초과하는 적절성 스코어를 갖는 모델링 절차를 선택한다. 임계 적절성 스코어는 사용자에 의해 제공되거나 탐색 엔진(110)에 의해 결정될 수 있다. 일부 실시예에서, 탐색 엔진(110)은 모델링 절차의 실행을 위해 이용 가능한 프로세싱 자원의 양에 따라, 실행을 위해 선택된 모델링 절차의 수를 증가시키거나 감소시키도록 임계 적절성 스코어를 조정할 수 있다.In embodiments in which relevance values are assigned to modeling procedures, search engine 110 may select a subset of modeling procedures based on relevance values. In some embodiments, the search engine 110 selects a modeling procedure that has a relevance score that exceeds a threshold relevance score. The threshold relevance score may be provided by the user or determined by the search engine 110 . In some embodiments, the search engine 110 may adjust the threshold relevance score to increase or decrease the number of modeling procedures selected for execution, depending on the amount of processing resources available for execution of the modeling procedures.

일부 실시예에서, 탐색 엔진(110)은 문제가 되는 예측 문제에 대한 임의의 모델링 절차에 할당된 최고 적절성 스코어의 특정 범위 내의 적절성 스코어를 갖는 모델링 절차를 선택한다. 범위는 절대적(예를 들어, 최고 스코어의 S 포인트 내의 스코어) 또는 상대적(예를 들어, 최고 스코어의 P% 내의 스코어)일 수 있다. 범위는 사용자에 의해 제공되거나 탐색 엔진(110)에 의해 결정될 수 있다. 일부 실시예에서, 탐색 엔진(110)은 모델링 절차의 실행을 위해 이용 가능한 프로세싱 자원의 양에 따라 실행을 위해 선택된 모델링 절차의 수를 증가 또는 감소시키도록 범위를 조정할 수 있다.In some embodiments, the search engine 110 selects a modeling procedure that has a relevance score within a certain range of the highest relevance score assigned to any modeling procedure for the prediction problem in question. A range can be absolute (eg, a score within the S point of the highest score) or relative (eg, a score within P% of the highest score). The range may be provided by the user or may be determined by the search engine 110 . In some embodiments, the search engine 110 may adjust the scope to increase or decrease the number of modeling procedures selected for execution according to the amount of processing resources available for execution of the modeling procedures.

일부 실시예에서, 탐색 엔진(110)은 문제가 되는 예측 문제에 대해 최고의 적절성 스코어를 갖는 모델링 절차의 부분을 선택한다. 동등하게, 탐색 엔진(110)은 (예를 들어, 모델링 절차에 대한 적절성 스코어가 이용 가능하지 않지만, 모델링 절차의 적절성의 순서(등급화)가 이용 가능한 경우에) 최고의 적절성 등급을 갖는 모델링 절차의 부분을 선택할 수 있다. 부분은 사용자에 의해 제공되거나 탐색 엔진(110)에 의해 결정될 수 있다. 일부 실시예에서, 탐색 엔진(110)은 모델링 절차의 실행을 위해 이용 가능한 프로세싱 자원의 양에 따라, 실행을 위해 선택된 모델링 절차의 수를 증가시키거나 감소시키도록 부분을 조정할 수 있다.In some embodiments, the search engine 110 selects the portion of the modeling procedure with the highest relevance score for the prediction problem in question. Equally, the search engine 110 determines the rating of the modeling procedure with the highest relevance rating (eg, if a relevance score for the modeling procedure is not available, but an order of relevance (grading) of the modeling procedure is available). part can be selected. The portion may be provided by the user or determined by the search engine 110 . In some embodiments, the search engine 110 may adjust portions to increase or decrease the number of modeling procedures selected for execution, depending on the amount of processing resources available for execution of the modeling procedures.

일부 실시예에서, 사용자는 실행될 하나 이상의 모델링 절차를 선택할 수 있다. 사용자가 선택한 절차는 탐색 엔진(110)에 의해 선택된 하나 이상의 모델링 절차에 추가하여 또는 그 대신에 실행될 수 있다. 특히 데이터 분석자의 직관 및 경험이, 모델링 시스템(100)이 예측 문제에 대한 모델링 절차의 적절성을 정확하게 추정하지 못했음을 나타내는 시나리오에서, 사용자가 실행을 위한 모델링 절차를 선택할 수 있게 하면, 예측 모델링 시스템(100)의 성능을 향상시킬 수 있다.In some embodiments, a user may select one or more modeling procedures to be executed. The user-selected procedure may be executed in addition to or in lieu of one or more modeling procedures selected by the search engine 110 . In particular, in scenarios where the intuition and experience of the data analyst indicates that the modeling system 100 has not accurately estimated the appropriateness of the modeling procedure for the predictive problem, allowing the user to select a modeling procedure for execution, the predictive modeling system ( 100) can be improved.

일부 실시예에서, 탐색 엔진(110)은, 모델링 절차 P0...PN 전부가 문제가 되는 예측 문제에 대해 적절한 것으로 결정되는 경우에도, 모델링 절차 P0...PN을 선택하지 않고 하나 이상의 다른 모델링 절차 P1...PN을 나타내는(예를 들어, 유사한) 모델링 절차 P0를 선택함으로써 검색 공간 평가의 입도(granularity)를 제어할 수 있다. 또한, 탐색 엔진(110)은 선택된 모델링 절차 P0를 실행한 결과를 모델링 절차 P1...PN을 실행한 결과를 나타내는 것으로서 취급할 수 있다. 검색 공간을 평가하는 이러한 거친(coarse-grained) 접근법은, 특히 검색 공간의 평가의 초기 단계 중에 적용되는 경우 프로세싱 자원을 절약할 수 있다. 탐색 엔진(110)이, 모델링 절차 P0가 예측 문제에 대한 가장 적절한 모델링 절차 중에 있는 것으로 나중에 결정하는 경우, 유사한 모델링 절차 P1...PN을 실행 및 평가함으로써 검색 공간의 관련 부분의 세분화된 평가가 수행될 수 있다.In some embodiments, the search engine 110 does not select one or more other modeling procedures P0...PN, even if all of the modeling procedures P0...PN are determined to be appropriate for the prediction problem in question. The granularity of the search space evaluation can be controlled by selecting a modeling procedure P0 that represents (eg, similar) procedure P1...PN. Also, the search engine 110 may treat the result of executing the selected modeling procedure P0 as representing the result of executing the modeling procedure P1...PN. This coarse-grained approach to evaluating the search space can save processing resources, especially when applied during the initial stage of the evaluation of the search space. If the search engine 110 later determines that the modeling procedure P0 is in the middle of the most appropriate modeling procedure for the prediction problem, a refined evaluation of the relevant part of the search space is achieved by executing and evaluating the similar modeling procedure P1...PN. can be performed.

도 3을 참조하면, 방법(300)의 단계(330)에서, 자원 할당 스케줄이 생성될 수 있다. 자원 할당 스케줄은 선택된 모델링 절차의 실행을 위한 프로세싱 자원을 할당할 수 있다. 일부 실시예에서, 자원 할당 스케줄은 문제가 되는 예측 문제에 대한 모델링 절차의 결정된 적절성에 기초하여 모델링 절차에 프로세싱 자원을 할당한다. 일부 실시예에서, 탐색 엔진(110)은 자원 할당 스케줄에 따라 선택된 모델링 절차를 실행하기 위한 명령으로 하나 이상의 프로세싱 노드에 자원 할당 스케줄을 송신한다.Referring to FIG. 3 , at step 330 of method 300 , a resource allocation schedule may be generated. The resource allocation schedule may allocate processing resources for the execution of the selected modeling procedure. In some embodiments, the resource allocation schedule allocates processing resources to the modeling procedure based on the determined suitability of the modeling procedure for the prediction problem in question. In some embodiments, the discovery engine 110 sends the resource allocation schedule to one or more processing nodes with instructions for executing the selected modeling procedure according to the resource allocation schedule.

할당된 프로세싱 자원은 시간 자원(예를 들어, 하나 이상의 프로세싱 노드의 실행 사이클, 하나 이상의 프로세싱 노드의 실행 시간 등), 물리적 자원(예를 들어, 다수의 프로세싱 노드, 상당한 양의 머신-판독 가능 저장소(예를 들어, 메모리 및/또는 2차 저장소) 등), 및/또는 다른 할당 가능한 프로세싱 자원을 포함할 수 있다. 일부 실시예에서, 할당된 프로세싱 자원은 분산 컴퓨팅 시스템 및/또는 클라우드-기반 컴퓨팅 시스템의 자원을 프로세싱하는 것일 수 있다. 일부 실시예에서, 프로세싱 자원이 할당 및/또는 사용될 때 비용이 발생할 수 있다(예를 들어, 데이터 센터의 자원을 사용하는 대가로 데이터 센터의 운영자에 의해 요금이 징수될 수 있음).The allocated processing resources include time resources (eg, execution cycles of one or more processing nodes, execution time of one or more processing nodes, etc.), physical resources (eg, multiple processing nodes, a significant amount of machine-readable storage). (eg, memory and/or secondary storage), and/or other allocable processing resources. In some embodiments, the allocated processing resources may be processing resources of distributed computing systems and/or cloud-based computing systems. In some embodiments, costs may arise when processing resources are allocated and/or used (eg, a fee may be collected by the operator of the data center in exchange for using the resources of the data center).

전술한 바와 같이, 자원 할당 스케줄은 문제가 되는 예측 문제에 대한 모델링 절차의 적절성에 기초하여 모델링 절차에 프로세싱 자원을 할당할 수 있다. 예를 들어, 자원 할당 스케줄은 예측 문제에 대해 더 높은 예측 적절성을 갖는 모델링 절차에 더 많은 프로세싱 자원을 할당하고, 예측 문제에 대해 더 낮은 예측 적절성을 갖는 모델링 절차에 더 적은 프로세싱 자원을 할당할 수 있어, 보다 유망한 모델링 절차가 제한된 프로세싱 자원 중 더 많은 부분의 이익을 누린다. 다른 예로서, 자원 할당 스케줄은 더 큰 데이터 세트를 프로세싱하기에 충분한 프로세싱 자원을 보다 높은 예측된 적절성을 갖는 모델링 절차에 할당하고, 보다 작은 데이터 세트를 프로세싱하기에 충분한 프로세싱 자원을 보다 낮은 예측된 적절성을 갖는 모델링 절차에 할당할 수 있다.As described above, the resource allocation schedule may allocate processing resources to the modeling procedure based on the adequacy of the modeling procedure for the prediction problem in question. For example, a resource allocation schedule may allocate more processing resources to modeling procedures with higher predictive relevance for prediction problems and less processing resources to modeling procedures with lower predictive relevance for prediction problems. Therefore, more promising modeling procedures benefit from a larger fraction of the limited processing resources. As another example, a resource allocation schedule allocates processing resources sufficient to process a larger data set to a modeling procedure with a higher predicted relevance, and allocates processing resources sufficient to process a smaller data set to a lower predicted relevance. can be assigned to a modeling procedure with

다른 예로서, 자원 할당 스케줄은 더 낮은 예측된 적절성을 갖는 모델링 절차의 실행 이전에 더 높은 예측된 적절성을 갖는 모델링 절차의 실행을 스케줄링할 수 있으며, 이는 또한 더 많은 프로세싱 자원을 보다 유망한 모델링 절차에 할당하는 효과를 가질 수 있다. 일부 실시예에서, 모델링 절차를 실행한 결과는, 그 결과가 이용 가능해질 때 사용자 인터페이스(120)를 통해 사용자에게 제시될 수 있다. 이러한 실시예에서, 더 낮은 예측된 적절성을 갖는 모델링 절차 이전에 실행될 더 높은 예측된 적절성을 갖는 모델링 절차를 스케줄링하는 것은 평가의 초기 단계에서 검색 공간의 평가에 대한 추가적인 중요한 정보를 사용자에게 제공할 수 있으며, 이에 의해 신속한 사용자-구동 조정을 검색 계획으로 조정하는 것을 용이하게 한다. 예를 들어, 예비 결과에 기초하여, 사용자는 매우 잘 수행될 것으로 예상되었던 하나 이상의 모델링 절차가 실제로 매우 열등하게 수행하는 것으로 결정할 수 있다. 사용자는 열등한 성능의 원인을 조사하고, 예를 들어, 열등한 성능이 데이터 세트의 준비에서의 오류로 인해 발생한 것으로 결정할 수 있다. 그 후, 사용자는 오류를 수정하고, 오류의 영향을 받은 모델링 절차의 실행을 재개할 수 있다.As another example, the resource allocation schedule may schedule the execution of a modeling procedure with a higher predicted relevance prior to the execution of a modeling procedure with a lower predicted relevance, which also dedicates more processing resources to a more promising modeling procedure. It can have the effect of allocating. In some embodiments, the results of executing the modeling procedure may be presented to the user via the user interface 120 when the results become available. In such an embodiment, scheduling a modeling procedure with a higher predicted relevance to be run before a modeling procedure with a lower predicted relevance may provide the user with additional important information about the evaluation of the search space at an early stage of evaluation. and thereby facilitates coordinating rapid user-driven adjustments to the search plan. For example, based on preliminary results, a user may determine that one or more modeling procedures that were expected to perform very well actually perform very poorly. The user may investigate the cause of the poor performance and determine, for example, that the poor performance is due to an error in the preparation of the data set. Thereafter, the user can correct the error and resume execution of the modeling procedure affected by the error.

일부 실시예에서, 자원 할당 스케줄은 모델링 절차의 자원 이용 특성 및/또는 병렬 특성에 적어도 부분적으로 기초하여 프로세싱 자원을 모델링 절차에 할당할 수 있다. 전술한 바와 같이, 모델링 절차에 대응하는 템플릿은, 모델링 절차가 분산 컴퓨팅 인프라스트럭처 상에서 얼마나 효율적으로 실행될 것인지를 추정하는 것과 관련된 메타데이터를 포함할 수 있다. 일부 실시예에서, 이 메타데이터는 모델링 절차의 자원 이용 특성(예를 들어, 주어진 크기의 데이터 세트에 대한 모델링 절차를 트레이닝 및/또는 테스트하는 데 필요한 프로세싱 자원)의 표시를 포함한다. 일부 실시예에서, 이 메타데이터는, 모델링 절차의 병렬 특성(예를 들어, 모델링 절차가 복수의 프로세싱 노드 상에서 병렬로 실행될 수 있는 정도)의 표시를 포함한다. 자원 할당 스케줄을 결정하기 위해 모델링 절차의 자원 이용 특성 및/또는 병렬 특성을 사용하는 것은 프로세싱 자원을 모델링 절차에 효율적으로 할당하는 것을 용이하게 할 수 있다.In some embodiments, the resource allocation schedule may allocate processing resources to the modeling procedure based at least in part on resource utilization characteristics and/or parallelism characteristics of the modeling procedure. As described above, a template corresponding to a modeling procedure may include metadata related to estimating how efficiently the modeling procedure will be executed on a distributed computing infrastructure. In some embodiments, this metadata includes an indication of the resource usage characteristics of the modeling procedure (eg, processing resources needed to train and/or test the modeling procedure for a data set of a given size). In some embodiments, this metadata includes an indication of the parallel nature of the modeling procedure (eg, the degree to which the modeling procedure may be executed in parallel on a plurality of processing nodes). Using resource utilization characteristics and/or parallelism characteristics of a modeling procedure to determine a resource allocation schedule may facilitate efficiently allocating processing resources to a modeling procedure.

일부 실시예에서, 자원 할당 스케줄은 모델링 절차의 실행을 위해 특정된 양의 프로세싱 자원을 할당할 수 있다. 할당 가능한 프로세싱 자원의 양은 사용자에 의해 제공되거나 다른 적절한 소스로부터 얻어질 수 있는 프로세싱 자원 예산에서 특정될 수 있다. 프로세싱 자원 예산은 모델링 절차를 실행하기 위해 사용되는 프로세싱 자원(예를 들어, 사용되는 시간의 양, 사용되는 프로세싱 노드의 수, 데이터 센터 또는 클라우드-기반 프로세싱 자원을 사용하는 데 발생하는 비용 등)에 대한 제한을 부과할 수 있다. 일부 실시예에서, 프로세싱 자원 예산은 특정 예측 문제에 대한 예측 모델을 생성하는 프로세스에 사용되는 총 프로세싱 자원에 제한을 부과할 수 있다.In some embodiments, the resource allocation schedule may allocate a specified amount of processing resources for the execution of the modeling procedure. The amount of allocable processing resources may be specified in the processing resource budget, which may be provided by the user or obtained from other suitable sources. The processing resource budget depends on the processing resources used to execute the modeling procedure (eg, the amount of time used, the number of processing nodes used, the cost of using the data center or cloud-based processing resources, etc.). restrictions may be imposed on In some embodiments, the processing resource budget may impose limits on the total processing resources used in the process of generating a predictive model for a particular predictive problem.

도 3으로 돌아가서, 방법(300)의 단계(340)에서, 자원 할당 스케줄에 따라 선택된 모델링 절차를 실행한 결과가 수신될 수 있다. 이러한 결과는 실행된 모델링 절차에 의해 생성된 하나 이상의 예측 모델을 포함할 수 있다. 일부 실시예에서, 모델링 절차의 실행이 예측 문제와 연관된 하나 이상의 데이터 세트에 예측 모델을 맞추는 것을 포함할 수 있기 때문에, 단계(340)에서 수신된 예측 모델은 예측 문제와 연관된 데이터 세트(들)에 맞춤화된다. 예측 모델을 예측 문제의 데이터 세트(들)에 맞춤화하는 것은 예측 모델을 생성하는 예측 모델링 절차의 하나 이상의 하이-파라미터를 튜닝하는 것, 생성된 예측 모델의 하나 이상의 파라미터를 튜닝하는 것 및/또는 다른 적절한 모델-맞춤 단계를 포함할 수 있다.3 , in step 340 of method 300 , a result of executing the selected modeling procedure according to the resource allocation schedule may be received. These results may include one or more predictive models generated by the modeling procedures executed. In some embodiments, the predictive model received at step 340 is applied to the data set(s) associated with the predictive problem, as execution of the modeling procedure may include fitting the predictive model to one or more data sets associated with the predictive problem. customized Fitting the predictive model to the data set(s) of the predictive problem includes tuning one or more high-parameters of the predictive modeling procedure that generates the predictive model, tuning one or more parameters of the generated predictive model, and/or other Appropriate model-fitting steps may be included.

일부 실시예에서, 단계(340)에서 수신된 결과는 예측 문제에 대한 모델의 성능의 평가(예를 들어, 스코어)를 포함한다. 이러한 평가는 예측 문제와 연관된 데이터 세트(들)에 대해 예측 모델을 테스트함으로써 얻을 수 있다. 일부 실시예에서, 예측 모델을 테스트하는 것은 예측 문제와 연관된 트레이닝 데이터 세트의 상이한 폴드(fold)를 사용하여 모델을 상호 유효성 검증하는 것을 포함한다. 일부 실시예에서, 모델링 절차의 실행은 생성된 모델의 테스트를 포함한다. 일부 실시예에서, 생성된 모델의 테스트는 모델링 절차의 실행과는 별개로 수행된다.In some embodiments, the results received at step 340 include an evaluation (eg, a score) of the model's performance on the prediction problem. This assessment may be obtained by testing the predictive model against the data set(s) associated with the predictive problem. In some embodiments, testing the predictive model includes mutually validating the model using different folds of the training data set associated with the predictive problem. In some embodiments, executing the modeling procedure comprises testing the generated model. In some embodiments, testing of the generated model is performed independently of execution of the modeling procedure.

모델은 적절한 테스트 기술에 따라 테스트되고 적절한 스코어링 메트릭(예를 들어, 목적 함수)에 따라 스코어링될 수 있다. 상이한 스코어링 메트릭은 모델링 정확도(예를 들어, 모델이 예측 문제의 결과를 올바르게 예측하는 비율), 잘못된 긍정 비율(예를 들어, 모델이 "긍정적인" 결과를 올바르지 않게 예측하는 비율), 잘못된 부정 비율(예를 들어, 모델이 "부정적인" 결과를 올바르지 않게 예측하는 비율), 긍정적인 예측값, 부정적인 예측값, 감도, 특이성 등에 한정되지 않지만 이를 포함하는 예측 모델의 성능의 상이한 양태에 대한 상이한 가중치를 둘 수 있다. 사용자는 사용자 인터페이스(120)를 통해 제시된 옵션의 세트, 또는 사용자 인터페이스(120)를 통해 특정한 커스텀 스코어링 메트릭(예를 들어, 커스텀 목적 함수)으로부터 표준 스코어링 메트릭(예를 들어, 적합도(goodness-of-fit), R-스퀘어 등)을 선택할 수 있다. 탐색 엔진(110)은 사용자-선택 또는 사용자-특정 스코어링 메트릭을 사용하여 예측 모델의 성능을 스코어링할 수 있다.The model may be tested according to an appropriate testing technique and scored according to an appropriate scoring metric (eg, an objective function). Different scoring metrics measure modeling accuracy (e.g., the rate at which the model correctly predicts the outcome of a prediction problem), the false positive rate (e.g., the rate at which the model incorrectly predicts a "positive" outcome), and the false negative rate. Different weights can be given to different aspects of the performance of a predictive model, including but not limited to (e.g., the rate at which the model incorrectly predicts a "negative" outcome), positive predictive values, negative predictive values, sensitivity, specificity, etc. have. A user may select a standard scoring metric (eg, goodness-of-fitness) from a set of options presented via user interface 120 , or a specific custom scoring metric (eg, a custom objective function) via user interface 120 . fit), R-square, etc.). The search engine 110 may score the performance of the predictive model using user-selected or user-specific scoring metrics.

도 3으로 돌아가서, 방법(300)의 단계(350)에서, 예측 모델은 생성된 예측 모델의 평가(예를 들어, 스코어)에 기초하여 예측 문제에 대해 선택될 수 있다. 공간 검색 엔진(110)은 예측 문제에 대한 예측 모델을 선택하기 위해 임의의 적절한 기준을 사용할 수 있다. 일부 실시예에서, 공간 검색 엔진(110)은 최고 스코어를 갖는 모델, 또는 임계 스코어를 초과하는 스코어를 갖는 임의의 모델, 또는 최고 스코어의 특정 범위 내의 스코어를 갖는 임의의 모델을 선택할 수 있다. 일부 실시예에서, 예측 모델의 스코어는 예측 문제에 대한 예측 모델을 선택할 때 공간 탐색 엔진(110)에 의해 고려되는 단 하나의 요인일 수도 있다. 공간 탐색 엔진에 의해 고려되는 다른 요인은 예측 모델의 복잡성, 예측 모델의 연산 요구 등을 포함할 수 있지만, 이에 한정되는 것은 아니다.3 , at step 350 of method 300 , a predictive model may be selected for a predictive problem based on an evaluation (eg, a score) of the generated predictive model. The spatial search engine 110 may use any suitable criteria to select a predictive model for a predictive problem. In some embodiments, spatial search engine 110 may select the model with the highest score, or any model with a score above a threshold score, or any model with a score within a certain range of the highest score. In some embodiments, the score of a predictive model may be the only factor considered by the spatial search engine 110 when selecting a predictive model for a predictive problem. Other factors considered by the spatial search engine may include, but are not limited to, the complexity of the predictive model, computational requirements of the predictive model, and the like.

일부 실시예에서, 예측 문제에 대한 예측 모델을 선택하는 것은 예측 모델의 서브세트를 반복적으로 선택하는 것 및 선택된 예측 모델을 데이터 세트의 더 큰 부분 또는 다른 부분에 대해 트레이닝시키는 것을 포함할 수 있다. 이러한 반복 프로세스는, 예측 모델이 예측 문제에 대해 선택될 때까지 또는 예측 모델을 생성하기 위해 예산 책정된 프로세싱 자원이 소진될 때까지 계속될 수 있다.In some embodiments, selecting a predictive model for a predictive problem may include iteratively selecting a subset of the predictive models and training the selected predictive model on a larger or different portion of the data set. This iterative process may continue until a predictive model is selected for the predictive problem or until the processing resources budgeted for generating the predictive model are exhausted.

예측 모델의 서브세트를 선택하는 것은 최고 스코어를 갖는 예측 모델의 부분을 선택하는 것, 임계 스코어를 초과하는 스코어를 갖는 모든 모델을 선택하는 것, 최고-스코어링 모델의 스코어의 특정 범위 내에서 스코어를 갖는 모든 모델을 선택하는 것, 또는 모델의 임의의 다른 적절한 그룹을 선택하는 것을 포함할 수 있다. 일부 실시예에서, 예측 모델의 서브세트를 선택하는 것은 방법(300)의 단계(320)를 참조하여 전술한 바와 같이, 예측 모델링 절차의 서브세트를 선택하는 것과 유사할 수 있다. 따라서, 예측 모델의 서브세트를 선택하는 상세 사항은 여기에서 논하지 않는다.Selecting a subset of predictive models includes selecting the portion of the predictive model with the highest score, selecting all models with scores above a threshold score, and scoring within a certain range of scores of the highest-scoring models. selecting all models with, or selecting any other suitable group of models. In some embodiments, selecting a subset of predictive models may be similar to selecting a subset of a predictive modeling procedure, as described above with reference to step 320 of method 300 . Accordingly, the details of selecting a subset of predictive models are not discussed herein.

선택된 예측 모델을 트레이닝하는 것은 선택된 모델의 트레이닝을 위해 프로세싱 노드의 프로세싱 자원을 할당하는 자원 할당 스케줄을 생성하는 것을 포함할 수 있다. 프로세싱 자원의 할당은 적어도 부분적으로, 선택된 모델을 생성하는 데 사용되는 모델링 기술의 적절성 및/또는 데이터 세트의 다른 샘플에 대한 선택된 모델의 스코어에 기초하여 결정될 수 있다. 선택된 예측 모델을 트레이닝하는 것은 선택된 예측 모델을 데이터 세트의 특정 부분에 맞추기 위해 프로세싱 노드에 명령어를 송신하는 것, 및 맞춤화된 모델 및/또는 맞춤화된 모델의 스코어를 포함하는 트레이닝 프로세스의 결과를 수신하는 것을 더 포함할 수 있다. 일부 실시예에서, 선택된 예측 모델을 트레이닝하는 것은 방법(300)의 단계(320-330)를 참조하여 전술한 바와 같이, 선택된 예측 모델링 절차를 실행하는 것과 유사할 수 있다. 따라서, 선택된 예측 모델을 트레이닝하는 상세 사항은 여기에서 논하지 않는다.Training the selected predictive model may include generating a resource allocation schedule that allocates processing resources of the processing node for training of the selected model. The allocation of processing resources may be determined based, at least in part, on the adequacy of modeling techniques used to generate the selected model and/or the score of the selected model relative to other samples of the data set. Training the selected predictive model includes sending instructions to a processing node to fit the selected predictive model to a particular portion of a data set, and receiving the results of a training process comprising the customized model and/or scores of the customized model. may include more. In some embodiments, training the selected predictive model may be similar to executing the selected predictive modeling procedure, as described above with reference to steps 320 - 330 of method 300 . Accordingly, the details of training the selected predictive model are not discussed herein.

일부 실시예에서, 단계(330 및 340)는, 예측 모델이 예측 문제에 대해 선택될 때까지 또는 예측 모델을 생성하기 위해 예산 책정된 프로세싱 자원이 소진될 때까지 반복적으로 수행될 수 있다. 각 반복의 끝에서, 예측 문제에 대한 예측 모델링 절차의 적절성은 모델링 절차를 실행한 결과에 적어도 부분적으로 기초하여 다시 결정될 수 있으며, 예측 모델링 절차의 새로운 세트가 다음 반복 동안 실행을 위해 선택될 수 있다.In some embodiments, steps 330 and 340 may be performed iteratively until a predictive model is selected for the predictive problem or until the processing resources budgeted for generating the predictive model are exhausted. At the end of each iteration, the adequacy of the predictive modeling procedure for the predictive problem may be re-determined based at least in part on the results of executing the modeling procedure, and a new set of predictive modeling procedures may be selected for execution during the next iteration. .

일부 실시예에서, 단계(330 및 340)의 반복에서 실행되는 모델링 절차의 수는 반복 횟수가 증가함에 따라 감소하는 경향이 있으며, 생성된 모델을 트레이닝 및/또는 테스트하는 데 사용되는 데이터의 양은 반복 횟수가 증가함에 따라 증가하는 경향이 있다. 따라서, 초기 반복은 상대적으로 작은 데이터 세트에 대해 상대적으로 많은 수의 모델링 절차를 실행하여 "넓은 그물을 던질(cast a wide net)" 수 있으며, 이후 반복은 초기 반복 동안 식별된 가장 유망한 모델링 절차의 보다 엄격한 테스트를 수행할 수 있다. 대안적으로 또는 추가적으로, 초기 반복은 검색 공간의 보다 거친(coarse-grained) 평가를 구현할 수 있으며, 이후 반복은 가장 유망한 것으로 결정된 검색 공간의 부분의 보다 세분화된(fine-grained) 평가를 구현할 수 있다.In some embodiments, the number of modeling procedures executed in iterations of steps 330 and 340 tends to decrease as the number of iterations increases, and the amount of data used to train and/or test the generated model decreases as the number of iterations increases. It tends to increase as the number of times increases. Thus, initial iterations can “cast a wide net” by running a relatively large number of modeling procedures on a relatively small data set, with subsequent iterations of the most promising modeling procedures identified during the initial iterations. More stringent tests can be performed. Alternatively or additionally, an initial iteration may implement a coarse-grained evaluation of the search space, and subsequent iterations may implement a fine-grained evaluation of the portion of the search space determined to be the most promising. .

일부 실시예에서, 방법(300)은 도 3에 나타내지 않은 하나 이상의 단계를 포함한다. 방법(300)의 추가 단계는 예측 문제와 연관된 데이터 세트를 프로세싱하고, 혼합된 예측 모델을 형성하기 위해 2개 이상의 예측 모델을 혼합하고, 및/또는 예측 문제에 대해 선택된 예측 모델을 튜닝하는 단계를 포함할 수 있다. 이들 단계의 일부 실시예가 아래에서 더욱 상세하게 설명된다.In some embodiments, method 300 includes one or more steps not shown in FIG. 3 . Additional steps of method 300 include processing the data set associated with the predictive problem, blending two or more predictive models to form a blended predictive model, and/or tuning the selected predictive model for the predictive problem. may include Some embodiments of these steps are described in more detail below.

방법(300)은 예측 문제와 연관된 데이터 세트가 프로세싱되는 단계를 포함할 수 있다. 일부 실시예에서, 예측 문제의 데이터 세트를 프로세싱하는 단계는 데이터 세트를 특성화하는 것을 포함한다. 데이터 세트의 특성화는 데이터 누락을 식별하는 것(예를 들어, 데이터 세트가 타겟과 강하게 상관되는 피처를 포함하지만, 피처의 값이 예측 문제에 의해 부과되는 조건 하에서 예측 문제에 대한 입력으로서 이용 가능하지 않은 시나리오), 누락 관측치를 검출하는 것, 누락 변수값을 검출하는 것, 아웃라잉 변수값을 식별하는 것 및/또는 중요한 예측값("중요 변수")를 가질 가능성이 있는 변수를 식별하는 것에 한정되지 않지만 이를 포함하는 데이터 세트로 잠재적 문제를 식별하는 것을 포함할 수 있다.Method 300 may include processing a data set associated with a prediction problem. In some embodiments, processing the data set of the prediction problem comprises characterizing the data set. Characterization of the data set is to identify missing data (e.g., the data set contains features that are strongly correlated with the target, but the values of the features are not available as inputs to the prediction problem under conditions imposed by the prediction problem. scenarios), detecting missing observations, detecting missing variable values, identifying outlining variable values, and/or identifying variables likely to have significant predictive values (“significant variables”). However, this may include identifying potential problems with data sets that contain them.

일부 실시예에서, 예측 문제의 데이터 세트를 프로세싱하는 것은 데이터 세트에 피처 엔지니어링을 적용하는 것을 포함한다. 데이터 세트에 피처 엔지니어링을 적용하는 것은 2개 이상의 피처를 조합하고 구성 피처를 조합된 피처로 대체하고, 날짜/시간 변수의 다른 양태(예를 들어, 시간 및 계절 정보)를 별도의 변수로 추출하고, 변수 값을 정규화하고, 누락된 변수 값을 채우는 것을 포함할 수 있다.In some embodiments, processing the data set of the prediction problem comprises applying feature engineering to the data set. Applying feature engineering to a data set involves combining two or more features, replacing constituent features with the combined features, extracting different aspects of date/time variables (e.g., time and seasonal information) as separate variables, and , normalizing variable values, and filling in missing variable values.

방법(300)은, 2개 이상의 예측 모델이 혼합되어 혼합된 예측 모델을 형성하는 단계를 포함할 수 있다. 혼합 단계는 예측 모델링 기술을 실행하고 생성된 예측 모델을 평가하는 것과 관련하여 반복적으로 수행될 수 있다. 일부 실시예에서, 혼합 단계는 실행/평가 반복의 일부에서만(예를 들어, 복수의 유망한 예측 모델이 생성된 이후의 반복에서) 수행될 수 있다.Method 300 may include mixing two or more predictive models to form a blended predictive model. The blending step may be performed iteratively in connection with implementing the predictive modeling technique and evaluating the generated predictive model. In some embodiments, the mixing step may be performed only as part of the run/evaluation iteration (eg, in iterations after a plurality of promising predictive models have been generated).

구성 모델의 출력을 조합함으로써 2개 이상의 모델이 혼합될 수 있다. 일부 실시예에서, 혼합된 모델은 구성 모델의 출력의 가중화된 선형 조합을 포함할 수 있다. 혼합된 예측 모델은, 특히 다른 구성 모델이 상보적인 경우에, 구성 예측 모델보다 더욱 양호하게 수행할 수 있다. 예를 들어, 구성 모델이 예측 문제의 데이터 세트의 상이한 부분에 대해 잘 수행되는 경향이 있을 때, 모델의 혼합이 다른(예를 들어, 유사한) 예측 문제에 대해 잘 수행되었을 때, 모델을 생성하는 데 사용되는 모델 기술이 비유사할 때(예를 들어, 하나의 모델은 선형 모델이고 다른 모델은 트리 모델임) 등의 경우에, 혼합된 모델이 잘 수행되는 것으로 예상될 수 있다. 일부 실시예에서, 혼합될 구성 모델들은 (예를 들어, 사용자의 직관과 경험에 기초하여) 사용자에 의해 식별된다.Two or more models can be mixed by combining the outputs of the constituent models. In some embodiments, the blended model may include a weighted linear combination of the outputs of the constituent models. A mixed predictive model may perform better than a constituent predictive model, especially when the other constituent models are complementary. For example, when a constitutive model tends to perform well on different parts of the data set of a prediction problem, when a mixture of models performs well on different (e.g., similar) prediction problems, creating a model Mixed models can be expected to perform well, such as when the model techniques used are dissimilar (eg, one model is a linear model and the other is a tree model). In some embodiments, the constituent models to be blended are identified by the user (eg, based on the user's intuition and experience).

방법(300)은 예측 문제에 대해 선택된 예측 모델이 튜닝되는 단계를 포함할 수 있다. 일부 경우, 배치 엔진(140)은 사용자에게 예측 모델을 구현하는 소스 코드를 제공하며, 이에 의해 사용자가 예측 모델을 튜닝할 수 있게 한다. 그러나, 예측 모델의 소스 코드를 공개하는 것은 일부 경우(예를 들어, 예측 모델링 기술 또는 예측 모델이 독점 기능 또는 정보를 포함하는 경우)에 바람직하지 않을 수 있다. 사용자가 모델의 소스 코드를 노출시키지 않고 예측 모델을 튜닝하도록 허용하기 위해, 배치 엔진(140)은 예측 모델의 표현(예를 들어, 수학적 표현)에 기초하여 모델의 파라미터를 튜닝하기 위한 인간 판독 가능한 규칙을 구성할 수 있고, 인간 판독 가능한 규칙을 사용자에게 제공할 수 있다. 그 후, 사용자는 모델의 소스 코드에 액세스하지 않고 인간 판독 가능한 규칙을 사용하여 모델의 파라미터를 튜닝할 수 있다. 따라서, 예측 모델링 시스템(100)은 독점 모델링 기술에 대한 소스 코드를 최종 사용자에게 노출시키지 않고 독점 예측 모델링 기술의 평가 및 튜닝을 지원할 수 있다.Method 300 may include tuning a selected predictive model for a predictive problem. In some cases, the placement engine 140 provides the user with source code implementing the predictive model, thereby allowing the user to tune the predictive model. However, disclosing the source code of predictive models may not be desirable in some cases (eg, where predictive modeling techniques or predictive models contain proprietary functions or information). To allow a user to tune a predictive model without exposing the model's source code, the placement engine 140 provides a human readable method for tuning parameters of the model based on a representation (eg, a mathematical representation) of the predictive model. Rules can be configured and human readable rules can be provided to users. The user can then tune the parameters of the model using human readable rules without access to the model's source code. Accordingly, the predictive modeling system 100 can support the evaluation and tuning of proprietary predictive modeling techniques without exposing the source code for the proprietary modeling techniques to end users.

일부 실시예에서, 예측 모델링 절차에 대응하는 머신-실행 가능 템플릿은 리던던트 연산을 감소시키기 위한 효율성-향상 피처를 포함할 수 있다. 이러한 효율성-향상 피처는, 검색 공간을 탐색하고 예측 모델을 생성하기 위해 상대적으로 적은 양의 프로세싱 자원이 예산 책정되는 경우 특히 가치 있을 수 있다. 전술한 바와 같이, 머신-실행 가능 템플릿은 대응하는 모델링 요소(예를 들어, 기술, 작업 또는 하위-작업)에 대한 고유 ID를 저장할 수 있다. 또한, 예측 모델링 시스템(100)은 데이터 세트 샘플 S에 고유 ID를 할당할 수 있다. 일부 실시예에서, 머신-실행 가능 템플릿 T가 데이터 세트 샘플 S 상에서 실행될 때, 템플릿은 그 모델링 요소 ID, 데이터 세트/샘플 ID 및 데이터 샘플 상에서 템플릿을 실행한 결과를 다른 템플릿에 액세스할 수 있는 저장 구조체(예를 들어, 테이블, 캐시, 해시 등)에 저장한다. 템플릿 T가 데이터 세트 샘플 S 상에서 호출될 때, 템플릿은 저장 구조체를 확인하여, 해당 데이터 세트 샘플 상에서 해당 템플릿을 실행한 결과가 이미 저장되어 있는지 여부를 결정한다. 저장되어 있다면, 동일한 결과를 얻기 위해 데이터 세트 샘플을 다시 프로세싱하는 것이 아니라, 템플릿은 저장 구조체로부터 대응 결과를 단순히 리트리빙하고 그 결과를 반환한 다음 종료한다. 저장 구조체는 모델링 절차가 실행되는 루프의 개별 반복 내에서, 절차-실행 루프의 복수의 반복에 걸쳐, 또는 복수의 검색 공간 탐색에 걸쳐 지속될 수 있다. 많은 작업과 하위-작업이 상이한 모델링 기술에 의해 공유되기 때문에 이러한 효율성-향상 피처를 통해 달성된 연산 절감이 상당할 수 있으며, 방법(300)은 종종 동일한 데이터 세트에 대해 상이한 모델링 기술을 실행하는 것을 포함한다.In some embodiments, a machine-executable template corresponding to a predictive modeling procedure may include efficiency-enhancing features to reduce redundant operations. This efficiency-enhancing feature can be particularly valuable when relatively small amounts of processing resources are budgeted for exploring the search space and generating predictive models. As described above, a machine-executable template may store a unique ID for a corresponding modeling element (eg, a skill, task, or sub-task). In addition, the predictive modeling system 100 may assign a unique ID to the data set sample S. In some embodiments, when the machine-executable template T is executed on data set sample S, the template stores its modeling element ID, data set/sample ID, and results of executing the template on data samples accessible to other templates. Store them in structures (eg tables, caches, hashes, etc.). When template T is called on data set sample S, the template checks the storage structure to determine whether the result of executing the template on that data set sample is already stored. If stored, rather than re-processing the data set samples to get the same result, the template simply retrieves the corresponding result from the storage structure, returns the result, and exits. The storage structure may persist within individual iterations of the loop in which the modeling procedure is executed, across multiple iterations of the procedure-execution loop, or across multiple search space searches. Since many tasks and sub-tasks are shared by different modeling techniques, the computational savings achieved through these efficiency-enhancing features can be significant, and the method 300 often avoids running different modeling techniques on the same data set. include

도 4는 일부 실시예에 따라 예측 문제에 대한 예측 모델을 선택하기 위한 방법(400)의 흐름도를 나타낸다. 방법(300)은 방법(400)의 예에 의해 구현될 수 있다.4 illustrates a flow diagram of a method 400 for selecting a predictive model for a predictive problem in accordance with some embodiments. Method 300 may be implemented by example of method 400 .

도 4의 예에서, 공간 검색 엔진(110)은 모델링 방법 라이브러리(212), 모델링 기술 라이브러리(130) 및 모델링 작업 라이브러리(232)를 사용하여 예측 모델링 문제에 대한 해결책을 위해 이용 가능한 모델링 기술의 공간을 탐색한다. 초기에, 사용자는 라이브러리(212)로부터 모델링 방법을 선택할 수 있거나, 공간 검색 엔진(110)이 디폴트 모델링 방법을 자동으로 선택할 수 있다. 이용 가능한 모델링 방법은 연역적 규칙의 적용에 기초한 모델링 기술의 선택, 유사한 예측 문제에 대한 유사한 모델링 기술의 성능에 기초한 모델링 기술의 선택, 메타 머신-학습 모델의 출력에 기초한 모델링 기술의 선택, 전술한 모델링 기술의 임의의 조합, 또는 다른 적절한 모델링 기술을 포함할 수 있지만 이에 한정되지 않는다.In the example of FIG. 4 , the spatial search engine 110 uses the modeling method library 212 , the modeling technique library 130 , and the modeling task library 232 to provide a spatial list of modeling techniques available for solutions to the predictive modeling problem. explore Initially, the user may select a modeling method from the library 212 , or the spatial search engine 110 may automatically select a default modeling method. Available modeling methods include the selection of modeling techniques based on the application of deductive rules, the selection of modeling techniques based on the performance of similar modeling techniques for similar predictive problems, the selection of modeling techniques based on the output of meta-machine-learning models, and the modeling described above. any combination of techniques, or other suitable modeling techniques.

방법(400)의 단계(402)에서, 탐색 엔진(110)은 해결될 예측 모델링 문제에 대한 데이터 세트를 선택할 것을 사용자에게 촉구한다. 사용자는 이전에 로드된 데이터 세트로부터 선택하거나 다른 정보 시스템으로부터 데이터를 리트리빙하기 위한 파일 또는 명령어로부터 새로운 데이터 세트를 생성할 수 있다. 파일의 경우에, 탐색 엔진(110)은 쉼표로 분리된 값, 탭-구분된, 확장 마크업 언어(XML: cXtensible Markup Language), JavaScript Object Notation, 원시 데이터베이스 파일 등에 한정되지 않지만 이를 포함하는 하나 이상의 포맷을 지원할 수 있다. 명령어의 경우에, 사용자는 정보 시스템의 유형, 그 네트워크 주소, 액세스 크리덴셜, 각각의 시스템 내의 데이터의 서브세트에 대한 참조 및 타겟 데이터 스키마를 원하는 데이터 세트 스키마에 매핑하기 위한 규칙을 특정할 수 있다. 이러한 정보 시스템은 데이터베이스, 데이터 웨어하우스, 데이터 통합 서비스, 분산 애플리케이션, 웹 서비스 등을 포함할 수 있지만 이에 한정되지 않는다.At step 402 of method 400, search engine 110 prompts the user to select a data set for the predictive modeling problem to be solved. Users can select from previously loaded data sets or create new data sets from files or commands for retrieving data from other information systems. In the case of files, the search engine 110 may include one or more, including but not limited to, comma-separated values, tab-delimited, cXtensible Markup Language (XML), JavaScript Object Notation, native database files, and the like. format can be supported. In the case of instructions, the user can specify the type of information system, its network address, access credentials, a reference to a subset of data within each system, and rules for mapping the target data schema to the desired data set schema. . Such information systems may include, but are not limited to, databases, data warehouses, data integration services, distributed applications, web services, and the like.

방법(400)의 단계(404)에서, 탐색 엔진(110)은 (예를 들어, 특정된 파일을 판독하거나 특정된 정보 시스템에 액세스함으로써) 데이터를 로딩한다. 내부적으로, 탐색 엔진(110)은 하나의 축 상의 피처 및 다른 축 상의 관측치를 갖는 2차원 매트릭스를 구성할 수 있다. 개념적으로, 매트릭스의 각 열은 변수에 대응할 수 있고, 매트릭스의 각각의 행은 관측치에 대응할 수 있다. 탐색 엔진(110)은 원본 소스로부터 얻어진 메타데이터(예를 들어, 명시적으로 특정된 데이터 유형) 및/또는 로딩 프로세스 중에 생성된 메타데이터(예를 들어, 변수의 명백한 데이터 유형; 변수가 서수, 기수 또는 인터프리팅된 유형으로 보이는지 여부 등)를 포함하는 관련 메타데이터를 변수에 첨부할 수 있다.At step 404 of method 400, search engine 110 loads data (eg, by reading a specified file or accessing a specified information system). Internally, the search engine 110 may construct a two-dimensional matrix with features on one axis and observations on the other axis. Conceptually, each column of the matrix may correspond to a variable, and each row of the matrix may correspond to an observation. Search engine 110 may include metadata obtained from the original source (eg, explicitly specified data types) and/or metadata generated during the loading process (eg, explicit data types of variables; ordinal, variables); You can attach relevant metadata to the variable, including whether it appears to be a radix or interpreted type, etc.

방법(400)의 단계(406)에서, 탐색 엔진(110)은 변수들 중 어느 것이 타겟인지 및/또는 어느 것이 피처인지를 식별할 것을 사용자에게 촉구한다. 일부 실시예에서, 탐색 엔진(110)은 또한 사용자에게 모델을 스코어링하는 데 사용되는 모델 성능의 메트릭(예를 들어, 탐색 엔진(110)에 의해 구현되는 통계적 학습 알고리즘에 의해, 통계적 최적화 기술의 관점에서 최적화되는 모델 성능의 메트릭)을 식별할 것을 촉구한다.At step 406 of method 400, search engine 110 prompts the user to identify which of the variables are targets and/or which are features. In some embodiments, search engine 110 also provides metrics of model performance used to score models to users (eg, by statistical learning algorithms implemented by search engine 110 , in terms of statistical optimization techniques). We urge you to identify the metrics of model performance that are optimized in

방법(400)의 단계(408)에서, 탐색 엔진(110)은 데이터 세트를 평가한다. 이 평가는 데이터 세트의 특성을 계산하는 것을 포함할 수 있다. 일부 실시예에서, 이 평가는 데이터 세트의 분석을 수행하는 것을 포함하며, 이는 사용자가 예측 문제를 더 잘 이해하는 것을 도울 수 있다. 이러한 분석은 문제가 되는 변수(예를 들어, 아웃라이어 또는 인라이어를 갖는 변수)를 식별하기 위해 하나 이상의 알고리즘을 적용하는 것, 변수 중요도를 결정하는 것, 변수 효과를 결정하는 것 및 효과 핫스팟(hotspot)을 식별하는 것을 포함할 수 있다.At step 408 of method 400, search engine 110 evaluates the data set. This evaluation may include calculating characteristics of the data set. In some embodiments, this assessment includes performing an analysis of the data set, which may help the user better understand the prediction problem. Such analysis involves applying one or more algorithms to identify the variable in question (e.g., variables with outliers or inliers), determining variable importance, determining variable effects, and effect hotspots ( hotspots).

데이터 세트의 분석은 임의의 적절한 기술을 사용하여 수행될 수 있다. 각각의 피처가 타겟을 예측하는 데 갖는 중요성의 정도를 측정하는 변수 중요도는 "그래디언트 부스티드 트리(gradient boosted tree)", Breiman 및 Cutler의 "랜덤 포리스트(Random Forest)", "교번 조건 기대" 및/또는 다른 적절한 기술을 사용하여 분석될 수 있다. 피처가 타겟에 미치는 효과의 방향과 크기를 측정하는 변수 효과는 "정규화된 회귀", "로지스틱 회귀" 및/또는 다른 적절한 기술을 사용하여 분석될 수 있다. 피처가 타겟을 예측할 때 가장 많은 정보를 제공하는 범위를 식별하는 효과 핫스팟은 "RuleFit"알고리즘 및/또는 다른 적절한 기술을 사용하여 분석될 수 있다.Analysis of the data set may be performed using any suitable technique. Variable importance, which measures the degree of importance each feature has in predicting its target, is a "gradient boosted tree", Breiman and Cutler's "Random Forest", "alternating condition expectation" and and/or analyzed using other suitable techniques. Variable effects that measure the direction and magnitude of the effect of a feature on a target may be analyzed using "normalized regression", "logistic regression" and/or other suitable techniques. Effect hotspots that identify the extent to which a feature provides the most information when predicting a target may be analyzed using the "RuleFit" algorithm and/or other suitable techniques.

일부 실시예에서, 원본 데이터 세트에 포함된 피처의 중요도를 평가하는 것에 추가하여, 방법(400)의 단계(408)에서 수행되는 평가는 피처 생성을 포함한다. 피처 생성 기술은 데이터 세트의 변수의 논리적 유형을 인터프리팅하고 다양한 변환을 변수에 적용하여 추가 피처를 생성하는 것을 포함할 수 있다. 변환의 예는 숫자 피처에 대한 다항식 및 로그 변환을 포함하지만 이에 한정되지 않는다. 인터프리팅된 변수(예를 들어, 날짜, 시간, 통화, 측정 단위, 백분율 및 위치 좌표)의 경우, 변환의 예는 예측 파워에 대한 날짜의 각각의 양태를 테스트하기 위해 날짜 스트링을 연속적인 시간 변수, 요일, 월 및 계절로 파싱하는 것을 포함하지만 이에 한정되지 않는다.In some embodiments, in addition to evaluating the importance of features included in the original data set, the evaluation performed at step 408 of method 400 includes feature creation. Feature creation techniques may include interpreting logical types of variables in a data set and applying various transformations to the variables to create additional features. Examples of transformations include, but are not limited to, polynomial and logarithmic transformations for numeric features. For interpreted variables (e.g., date, time, currency, unit of measure, percentage, and location coordinates), an example of a transformation is to convert a date string to successive times to test each aspect of the date for predictive power. including but not limited to parsing into variables, day, month, and season.

잠재적인 예측 모델링 기술로 그 체계적인 테스트에 선행하는 수치 및/또는 인터프리팅된 변수의 체계적인 변환은 예측 모델링 시스템(100)이 잠재적인 모델 공간을 더 검색하고 더 정밀한 예측을 달성할 수 있게 한다. 예를 들어 "날짜/시간"의 경우, 시간 정보와 계절 정보를 별도의 피처로 분리하는 것은, 이러한 개별 피처가 종종 타겟 변수와 매우 다른 관계를 나타내기 때문에 매우 유용할 수 있다.Systematic transformation of numerical and/or interpreted variables that precede systematic testing of potential predictive modeling techniques enables predictive modeling system 100 to further search the potential model space and achieve more precise predictions. For "date/time" for example, separating temporal and seasonal information into separate features can be very useful as these individual features often exhibit very different relationships to the target variable.

원래 피처를 인터프리팅하고 변환함으로써 유도된 피처를 생성하는 것은 원래의 데이터 세트의 차원을 증가시킬 수 있다. 예측 모델링 시스템(100)은 데이터 세트의 차원의 증가를 상쇄시킬 수 있는 차원 감소 기술을 적용할 수 있다. 그러나, 일부 모델링 기술은 다른 모델링 기술보다 차원에 더 민감하다. 또한, 다른 차원 감소 기술은 다른 기술보다 일부 모델링 기술로 더 잘 작동하는 경향이 있다. 일부 실시예에서, 예측 모델링 시스템(100)은 이들 상호 작용을 설명하는 메타데이터를 유지한다. 시스템(100)은 메타데이터가 나타내는 성공 가능성이 가장 높은 조합을 우선화하는 차원 감소 기술과 모델링 기술의 다양한 조합을 체계적으로 평가할 수 있다. 시스템(100)은 시간 경과에 따른 조합의 경험적 성능에 기초하여 이 메타데이터를 추가로 갱신할 수 있고 새로운 차원 감소 기술이 발견될 때 이를 통합할 수 있다.Generating derived features by interpreting and transforming the original features can increase the dimensionality of the original data set. The predictive modeling system 100 may apply a dimensionality reduction technique capable of offsetting an increase in the dimension of the data set. However, some modeling techniques are more dimensionally sensitive than others. Also, other dimensionality reduction techniques tend to work better with some modeling techniques than others. In some embodiments, the predictive modeling system 100 maintains metadata describing these interactions. The system 100 may systematically evaluate various combinations of dimensionality reduction techniques and modeling techniques that prioritize the combinations with the highest probability of success indicated by the metadata. System 100 may further update this metadata based on the empirical performance of combinations over time and may incorporate new dimensionality reduction techniques as they are discovered.

방법(400)의 단계(410)에서, 예측 모델링 시스템(100)은 사용자에게 데이터 세트 평가의 결과(예를 들어, 데이터 세트 분석의 결과, 데이터 세트의 특성 및/또는 데이터 세트 변환의 결과)를 제시한다. 일부 실시예에서, 데이터 세트 평가의 결과는 (예를 들어, 그래프 및/또는 테이블을 사용하여) 사용자 인터페이스(120)를 통해 제시된다.At step 410 of method 400, predictive modeling system 100 provides the user with the results of data set evaluation (eg, results of data set analysis, characteristics of data sets, and/or results of data set transformations). present. In some embodiments, the results of the data set evaluation are presented via user interface 120 (eg, using graphs and/or tables).

방법(400)의 단계(412)에서, 사용자는 (예를 들어, 데이터 세트 평가의 결과에 기초하여) 데이터 세트를 정제할 수 있다. 이러한 정제는 하나 이상의 피처에 대한 누락값 또는 아웃라이어를 처리하고, 인터프리팅된 변수의 유형을 변경하고, 고려 중인 변환을 변경하고, 고려에서 피처를 제거하고, 특정 값을 직접 편집하고, 함수를 사용하여 피처를 변환하고, 공식을 사용하여 피처의 값을 조합하고, 완전히 새로운 피처를 데이터 세트에 추가하는 등을 위한 선택 방법을 포함할 수 있다.At step 412 of method 400 , the user may refine the data set (eg, based on the results of the data set evaluation). This refinement handles missing values or outliers for one or more features, changes the type of interpreted variables, changes the transformation under consideration, removes features from consideration, edits specific values directly, functions It can include selection methods for transforming features using , combining the values of features using formulas, adding entirely new features to the data set, and so on.

방법(400)의 단계(402-412)는 방법(300)의 일부 실시예와 관련하여 전술한 바와 같이, 예측 문제의 데이터 세트를 프로세싱하는 단계의 일 실시예를 나타낼 수 있다.Steps 402 - 412 of method 400 may represent one embodiment of processing a data set of a prediction problem, as described above with respect to some embodiments of method 300 .

방법(400)의 단계(414)에서, 검색 공간 엔진(100)은 모델링 기술 라이브러리(130)로부터 이용 가능한 모델링 기술을 로드할 수 있다. 어떤 모델링 기술이 이용 가능한지의 결정은 선택된 모델링 방법에 따를 수 있다. 일부 실시예에서, 모델링 기술의 로드는 방법(400)의 하나 이상의 단계(402-412)와 병렬로 발생할 수 있다.At step 414 of method 400 , search space engine 100 may load modeling techniques available from modeling technique library 130 . The determination of which modeling techniques are available may depend on the chosen modeling method. In some embodiments, the loading of modeling techniques may occur in parallel with one or more steps 402 - 412 of method 400 .

방법(400)의 단계(416)에서, 사용자는 수동 모드 또는 자동 모드에서 모델링 해결책에 대한 검색을 시작하도록 탐색 엔진(110)에 명령한다. 자동 모드에서, 탐색 엔진(110)은 디폴트 샘플링 알고리즘을 사용하여 데이터 세트를 분할하고(단계(418)), 디폴트 우선 순위 알고리즘을 사용하여 모델링 기술에 우선 순위화한다(단계(420)). 모델링 기술의 우선 순위화는 예측 문제에 대한 모델링 기술의 적절성을 결정하고, 그 결정된 적절성에 기초하여 실행을 위한 모델링 기술의 적어도 서브세트를 선택하는 것을 포함할 수 있다.At step 416 of method 400, the user instructs the search engine 110 to initiate a search for modeling solutions in either a manual mode or an automatic mode. In automatic mode, search engine 110 partitions the data set using a default sampling algorithm (step 418) and prioritizes modeling techniques using a default priority algorithm (step 420). Prioritizing the modeling techniques may include determining the appropriateness of the modeling technique for the prediction problem, and selecting at least a subset of the modeling techniques for implementation based on the determined appropriateness.

수동 모드에서, 탐색 엔진(110)은 데이터 분할을 제시하고(단계(422)), 모델링 기술의 우선 순위를 제시한다(단계(424)). 사용자는 제시된 데이터 분할을 수용하거나 커스텀 분할을 특정할 수 있다(단계(426)). 마찬가지로, 사용자는 모델링 기술의 제시된 우선 순위를 수용하거나 모델링 기술의 커스텀 우선 순위를 특정할 수 있다(단계(428)). 일부 실시예에서, 사용자는, 탐색 엔진(110)이 모델링 기술의 실행을 시작하기 전에 (예를 들어, 모델링 기술 빌더(220) 및/또는 모델링 작업 빌더(230)를 사용하여) 하나 이상의 모델링 기술을 수정할 수 있다(단계(430)).In the manual mode, the search engine 110 presents data partitioning (step 422) and prioritization of modeling techniques (step 424). The user may accept the presented data partition or specify a custom partition (step 426). Likewise, the user may accept the suggested priorities of modeling techniques or specify a custom priority for modeling techniques (step 428). In some embodiments, a user may select one or more modeling techniques (eg, using modeling technique builder 220 and/or modeling job builder 230 ) before search engine 110 starts executing the modeling technique. can be modified (step 430).

상호 유효성 검증을 용이하게 하기 위해, 예측 모델링 시스템(100)은 데이터 세트를 K "폴드"로 분할(또는 데이터 세트의 분할을 제안)할 수 있다. 상호 유효성 검증은 각각의 맞춤화 중에 상이한 폴드가 테스트 세트로의 역할을 하고 나머지 폴드가 트레이닝 세트로서의 역할을 하도록 분할된 데이터 세트에 예측 모델을 K회 맞춤화하는 것을 포함한다. 상호 유효성 검증은, 예측 모델의 정확도가 상이한 트레이닝 데이터에 따라 어떻게 변하는지에 대한 유용한 정보를 생성할 수 있다. 단계(418 및 422)에서, 예측 모델링 시스템은 데이터 세트를 K개의 폴드로 분할할 수 있으며, 여기서 폴드의 수 K는 디폴트 파라미터이다. 단계(426)에서, 사용자는 폴드의 수 K를 변경하거나 상호 유효성 검증의 사용을 모두 취소할 수 있다.To facilitate cross-validation, predictive modeling system 100 may partition (or suggest partitioning of) the data set into K “folds”. Mutual validation involves fitting the predictive model K times to a partitioned data set such that during each fit a different fold serves as the test set and the remaining folds serve as the training set. Mutual validation can generate useful information about how the accuracy of a predictive model changes with different training data. At steps 418 and 422, the predictive modeling system may partition the data set into K folds, where the number of folds K is the default parameter. In step 426, the user may change the number of folds K or cancel the use of mutual validation altogether.

예측 모델의 엄격한 테스트를 용이하게 하기 위해, 예측 모델링 시스템(100)은 트레이닝 세트 및 "홀드아웃(holdout)" 테스트 세트로 데이터 세트를 분할(또는 데이터 세트의 분할을 제시)할 수 있다. 일부 실시예에서, 트레이닝 세트는 상호 유효성 검증을 위해 K 폴드로 추가로 분할된다. 트레이닝 세트는 예측 모델을 트레이닝하고 평가하는 데 사용될 수 있지만, 홀드아웃 테스트 세트는 예측 모델을 테스트하기 위해 엄격하게 보존될 수 있다. 일부 실시예에서, 예측 모델링 시스템(100)은 지정된 권한 및/또는 크리덴셜(credential)을 가진 사용자가 이를 해제할 때까지 홀드아웃 테스트를 액세스 불가능하게 설정함으로써 테스트용(트레이닝용이 아님) 홀드아웃 테스트 세트의 사용을 강하게 실시할 수 있다. 단계(418 및 422)에서, 예측 모델링 시스템(100)은, 데이터 세트의 디폴트 퍼센티지가 홀드아웃 세트에 예약되도록 데이터 세트를 분할할 수 있다. 단계(426)에서, 사용자는 홀드아웃 세트에 예약된 데이터 세트의 퍼센티지를 변경하거나, 홀드아웃 세트의 사용을 모두 취소할 수 있다.To facilitate rigorous testing of predictive models, predictive modeling system 100 may partition (or suggest partitioning of) a data set into a training set and a "holdout" test set. In some embodiments, the training set is further divided into K folds for mutual validation. The training set can be used to train and evaluate the predictive model, while the holdout test set can be strictly preserved to test the predictive model. In some embodiments, predictive modeling system 100 provides holdout tests for testing (not training) by making the holdout tests inaccessible until a user with specified privileges and/or credentials releases them. The use of the set can be performed strongly. At steps 418 and 422 , the predictive modeling system 100 may partition the data set such that a default percentage of the data set is reserved for the holdout set. In step 426, the user may change the percentage of data sets reserved for the holdout set, or cancel use of the holdout set altogether.

일부 실시예에서, 예측 모델링 시스템(100)은 모델링 검색 공간의 평가 동안 컴퓨팅 자원의 효율적인 사용을 용이하게 하기 위해 데이터 세트를 분할한다. 예를 들어, 예측 모델링 시스템(100)은 데이터 세트의 상호 유효성 검증 폴드를 더 작은 샘플로 분할할 수 있다. 예측 모델이 맞춤화되는 데이터 샘플의 크기를 감소시키는 것은 상이한 다른 모델링 기술의 상대적 성능을 평가하는 데 필요한 컴퓨팅 자원의 양을 감소시킬 수 있다. 일부 실시예에서, 더 작은 샘플은 폴드의 데이터의 랜덤 샘플을 취함으로써 생성될 수 있다. 마찬가지로, 예측 모델이 맞춤화되는 데이터 샘플의 크기를 감소시키는 것은 예측 모델의 파라미터 또는 모델링 기술의 하이퍼-파라미터를 튜닝하는 데 필요한 컴퓨팅 자원의 양을 감소시킬 수 있다. 하이퍼-파라미터는 모델 맞춤 프로세스의 속도, 효율성 및/또는 정확도에 영향을 줄 수 있는 모델링 기술의 변수 설정을 포함한다. 하이퍼-파라미터의 예는 탄성-네트 모델의 페널티 파라미터, 그래디언트 부스티드 트리 모델의 트리의 수, 최근접 이웃 모델의 이웃의 수를 포함하지만 이에 한정되지 않는다.In some embodiments, the predictive modeling system 100 partitions the data set to facilitate efficient use of computing resources during evaluation of the modeling search space. For example, the predictive modeling system 100 may split the mutually validating folds of a data set into smaller samples. Reducing the size of the data samples to which the predictive model is tailored can reduce the amount of computing resources required to evaluate the relative performance of different different modeling techniques. In some embodiments, smaller samples may be generated by taking random samples of the fold's data. Likewise, reducing the size of the data samples to which the predictive model is customized may reduce the amount of computing resources required to tune the parameters of the predictive model or hyper-parameters of the modeling technique. Hyper-parameters include setting parameters in modeling techniques that can affect the speed, efficiency and/or accuracy of the model fitting process. Examples of hyper-parameters include, but are not limited to, the penalty parameter of the elastic-net model, the number of trees in the gradient boosted tree model, and the number of neighbors in the nearest neighbor model.

방법(400)의 단계(432-458)에서, 선택된 모델링 기술은 검색 공간을 평가하기 위해 분할된 데이터를 사용하여 실행될 수 있다. 이 단계는 아래에서 더욱 상세히 설명된다. 편의상, 데이터 파티셔닝과 관련된 검색 공간의 평가의 일부 양태가 다음 단락에서 설명된다.At steps 432-458 of method 400, the selected modeling technique may be implemented using the segmented data to evaluate the search space. This step is described in more detail below. For convenience, some aspects of the evaluation of the search space related to data partitioning are described in the following paragraphs.

상호 유효성 검증 폴드의 테스트 세트를 포함하는 샘플 데이터를 사용하여 하이퍼-파라미터를 튜닝하는 것은 모델 초과-맞춤으로 이어질 수 있고, 이에 의해 상이한 모델의 성능의 비교를 신뢰할 수 없게 만든다. "특정 접근법"을 사용하는 것이 이 문제를 피하는 것을 도울 수 있으며, 몇몇 다른 중요한 이점을 제공할 수 있다. 따라서, 탐색 엔진(110)의 몇몇 실시예는 "네스팅된(nested) 상호 유효성 검증" 기술을 구현하며, 이에 의해 k-폴드 크로스 검증의 2개의 루프가 적용된다. 외부 루프는 주어진 모델을 다른 모델과 비교하고 장래의 샘플 상에서 각 모델의 예측을 교정하는 것 모두를 위한 테스트 세트를 제공한다. 내부 루프는 주어진 모델의 하이퍼-파라미터를 튜닝하기 위한 테스트 세트 및 도출된 피처에 대한 트레이닝 세트 모두를 제공한다.Tuning hyper-parameters using sample data comprising a test set of mutual validation folds can lead to model over-fitting, thereby making comparisons of the performance of different models unreliable. Using a “specific approach” can help avoid this problem, and can provide several other important advantages. Accordingly, some embodiments of search engine 110 implement a “nested mutual validation” technique, whereby two loops of k-fold cross validation are applied. The outer loop provides a set of tests for both comparing a given model to other models and correcting each model's predictions on future samples. The inner loop provides both a test set for tuning the hyper-parameters of a given model and a training set for the derived features.

또한, 내부 루프에서 생성된 상호 유효성 검증 예측은 복수의 상이한 모델을 조합하는 혼합 기술을 용이하게 할 수 있다. 일부 실시예에서, 혼합기로의 입력은 샘플 외부의 모델로부터의 예측이다. 샘플 내 모델로부터의 예측을 사용하는 것은 일부 혼합 알고리즘과 함께 사용되면 초과-맞춤으로 귀결될 수 있다. 네스팅된 상호 유효성 검증을 일관되게 적용하기 위한 잘 정의된 프로세스 없이는, 가장 경험이 많은 사용자조차도 단계를 생략하거나 올바르지 않게 구현할 수 있다. 따라서, k-폴드 크로스 검증의 이중 루프의 적용은 예측 모델링 시스템(100)이 이하의 5개의 중요한 목적을 동시에 달성할 수 있게 한다: (1) 다수의 하이퍼-파라미터를 갖는 복잡한 모델을 튜닝 (2) 정보가 있는 도출된 피처를 개발, (3) 2 이상의 모델의 혼합을 튜닝, (4) 단일 및/또는 혼합 모델의 예측을 교정, 및 (5) 상이한 모델을 정확하게 비교할 수 있는 순수한 접촉되지 않은 테스트 세트를 유지.Additionally, mutually validated predictions generated in inner loops can facilitate mixing techniques that combine multiple different models. In some embodiments, the input to the mixer is a prediction from a model outside the sample. Using predictions from in-sample models can result in over-fitting when used with some blending algorithms. Without a well-defined process for consistently applying nested mutual validation, even the most experienced users can omit steps or implement them incorrectly. Thus, application of the double loop of k-fold cross validation enables the predictive modeling system 100 to simultaneously achieve the following five important objectives: (1) tuning a complex model with multiple hyper-parameters (2) ) develop informed derived features, (3) tune a mixture of two or more models, (4) calibrate the predictions of single and/or mixed models, and (5) pure non-contact capable of accurately comparing different models. Keep the test set.

방법(400)의 단계(432)에서, 탐색 엔진(110)은 선택된 모델링 기술의 초기 세트의 실행을 위한 자원 할당 스케줄을 생성한다. 자원 할당 스케줄에 의해 나타내어지는 자원의 할당은 모델링 기술의 우선 순위, 분할된 데이터 샘플 및 이용 가능한 연산 자원에 기초하여 결정될 수 있다. 일부 실시예에서, 탐색 엔진(110)은 선택된 모델링 기술에 그리디어하게(greedily) 자원을 할당한다(예를 들어, 연산 자원을 아직 실행하지 않은 최고-우선 순위 모델링 기술에 차례로 할당함).At step 432 of method 400, search engine 110 generates a resource allocation schedule for execution of an initial set of selected modeling techniques. The allocation of resources indicated by the resource allocation schedule may be determined based on priorities of modeling techniques, divided data samples, and available computational resources. In some embodiments, search engine 110 greedily allocates resources to selected modeling techniques (eg, in turn assigns computational resources to highest-priority modeling techniques that have not yet been implemented).

방법(400)의 단계(434)에서, 탐색 엔진(110)은 자원 할당 스케줄에 따라 모델링 기술의 실행을 개시한다. 일부 실시예에서, 모델링 기술의 세트의 실행은 데이터 세트로부터 추출된 동일한 데이터 샘플에 대해 하나 이상의 모델을 트레이닝하는 것을 포함할 수 있다.At step 434 of method 400, search engine 110 initiates execution of the modeling technique according to the resource allocation schedule. In some embodiments, executing the set of modeling techniques may include training one or more models on the same data samples extracted from the data set.

방법(400)의 단계(436)에서, 탐색 엔진(110)은 모델링 기술의 실행 상태를 모니터링한다. 모델링 기술이 실행을 끝내면, 탐색 엔진(110)은 대응 데이터 샘플에 대한 맞춤화된 모델 및/또는 모델 맞춤의 메트릭을 포함할 수 있는 결과를 수집한다(단계(438)). 이러한 메트릭은 지니(Gini) 계수, r-스퀘어드, 잔류 평균 제곱 에러(residual mean squared error), 그 임의의 변형 등에 한정되지 않지만 이를 포함하는 맞춤화를 수행하는 기본 소프트웨어 컴포넌트로부터 추출될 수 있는 임의의 메트릭을 포함할 수 있다.At step 436 of method 400, search engine 110 monitors the execution status of the modeling technique. When the modeling technique has finished running, the search engine 110 collects results, which may include metrics of model fit and/or customized models to corresponding data samples (step 438). These metrics include, but are not limited to, Gini coefficients, r-squared, residual mean squared error, any variations thereof, and the like, any metric that can be extracted from the underlying software component performing customization. may include

방법(400)의 단계(440)에서, 탐색 엔진(110)은 (예를 들어, 모델 맞춤 메트릭에 따라 생성한 모델의 성능에 기초한) 고려 사항으로부터 최악의-수행 모델링 기술을 제거한다. 탐색 엔진(110)은 모델 맞춤 메트릭의 최소 임계값을 만족하는 모델을 생성하지 않는 것을 제거하거나, 생성되는 모든 모델의 톱 부분에서 현재 모델을 생성한 것을 제외한 모든 모델링 기술을 제거하거나 톱 모델의 특정 범위 내에 있는 모델을 생성하지 않은 임의의 모델링 기술을 제거하는 것에 한정되지 않지만 이를 포함하여, 적절한 기술을 사용하여 어떠한 모델링 기술을 제거할 것인지를 결정할 수 있다. 일부 실시예에서, 상이한 절차가 평가의 상이한 단계에서 모델링 기술을 제거하기 위해 사용될 수 있다. 일부 실시예에서, 사용자는 상이한 모델링 문제에 대해 상이한 제거 기술을 특정하도록 허용될 수 있다. 일부 실시예에서, 사용자는 커스텀 제거 기술을 구축하고 사용하도록 허용될 수 있다. 일부 실시예에서, 메타-통계-학습 기술이 제거-기술 중에서 선택하고 및/또는 이들 기술의 파라미터를 조정하는 데 사용될 수 있다.At step 440 of method 400, search engine 110 removes the worst-performing modeling technique from consideration (eg, based on the performance of the model generated according to model fit metrics). The search engine 110 removes those that do not generate a model that satisfies the minimum threshold of the model fitting metric, removes all modeling techniques except for those that have generated the current model from the top part of all generated models, or removes the specific model of the top model. Appropriate techniques can be used to determine which modeling techniques to remove, including, but not limited to, removing any modeling techniques that do not create models that are within scope. In some embodiments, different procedures may be used to eliminate modeling techniques at different stages of evaluation. In some embodiments, a user may be allowed to specify different removal techniques for different modeling problems. In some embodiments, users may be allowed to build and use custom removal techniques. In some embodiments, meta-statistical-learning techniques may be used to select among and/or adjust parameters of removal-techniques.

탐색 엔진(110)이 모델 성능을 계산하고 고려 사항으로부터 모델링 기술을 제거함에 따라, 예측 모델링 시스템(100)은 사용자 인터페이스(120)를 통해 사용자에게 검색 공간 평가의 진행을 제시할 수 있다(단계(442)). 일부 실시예에서, 단계(444)에서, 탐색 엔진(110)은 사용자로 하여금 검색 공간 평가의 진행, 사용자의 전문 지식 및/또는 다른 적절한 정보에 기초하여 검색 공간을 평가하는 프로세스를 수정할 수 있게 한다. 사용자가 검색 공간 평가 프로세스에 대한 수정을 특정하면, 공간 평가 엔진(110)은 그에 따라 프로세싱 자원을 재할당한다(예를 들어, 어떠한 작업(job)이 영향을 받는지를 결정하고, 이를 스케줄링 큐(queue) 내에서 이동시키거나 큐로부터 삭제함). 다른 작업은 이전과 같이 프로세싱을 계속한다.As the search engine 110 calculates model performance and removes modeling techniques from consideration, the predictive modeling system 100 may present to the user the progress of the search space evaluation via the user interface 120 (step (step ) 442)). In some embodiments, at step 444, the search engine 110 allows the user to modify the process of evaluating the search space based on the progress of the search space evaluation, the user's expertise and/or other suitable information. . If the user specifies modifications to the search space evaluation process, the spatial evaluation engine 110 reallocates processing resources accordingly (eg, determines which jobs are affected, and puts them into the scheduling queue ( (moved in or removed from queue). Other jobs continue processing as before.

사용자는 많은 다른 방식으로 검색 공간 평가 프로세스를 수정할 수 있다. 예를 들어, 사용자는, 선택된 메트릭에 대해 생성한 모델의 성능이 양호했던 경우에도, 일부 모델링 기술의 우선 순위를 줄이거나 일부 모델링 기술을 고려 사항으로부터 제거할 수 있다. 다른 예로서, 사용자는, 생성한 모델링의 성능이 열등했던 경우에도, 일부 모델링 기술의 우선 순위를 높이거나 일부 모델링 기술을 고려 사흥으로 선택할 수 있다. 다른 예로서, 다른 예로서, 사용자는 특정 모델의 평가 또는 특정된 모델링 기술의 실행을 추가 데이터 샘플에 대해 우선 순위화할 수 있다. 다른 예로서, 사용자는 하나 이상의 모델링 기술을 수정하고 수정된 기술을 고려 사항으로 선택할 수 있다. 다른 예로서, 사용자는 (예를 들어, 피처를 추가하거나, 피처를 제거하거나, 상이한 피처를 선택함으로써) 모델링 기술을 트레이닝하거나 모델을 맞춤화하는 데 사용되는 피처를 변경할 수 있다. 이러한 변경은, 결과가 피처의 크기가 정규화를 필요로 하거나 피처의 일부가 "데이터 누락"을 나타내는 경우, 유용할 수 있다.Users can modify the search space evaluation process in many different ways. For example, a user may reduce the priority of some modeling techniques or remove some modeling techniques from consideration, even if the performance of the model generated for the selected metric was good. As another example, even when the performance of the generated modeling was inferior, the user may increase the priority of some modeling techniques or select some modeling techniques as considerations. As another example, as another example, a user may prioritize evaluation of a particular model or execution of a specified modeling technique over additional data samples. As another example, a user may modify one or more modeling techniques and select the modified techniques for consideration. As another example, a user may change features used to customize a model or train modeling techniques (eg, by adding features, removing features, or selecting different features). This change can be useful if the results indicate that the size of the features needs normalization or that some of the features are "missing data".

일부 실시예에서, 단계(432-444)는 반복적으로 수행될 수 있다. (예를 들어, 단계(440)에서의 시스템 또는 단계(444)에서의 사용자에 의해) 제거되지 않은 모델링 기술은 또 다른 반복에서 생존한다. 이전 반복(또는 반복들)에서 생성된 모델의 성능에 기초하여, 탐색 엔진(110)은 대응하는 모델링 기술의 우선 순위를 조정하고 이에 따라 프로세싱 자원을 모델링 기술에 할당한다. 연산 자원이 이용 가능해지면, 엔진은 이용 가능한 자원을 사용하여 갱신된 우선 순위에 기초하여 모델-기술-실행 작업을 론칭한다.In some embodiments, steps 432-444 may be performed iteratively. Modeling techniques that are not removed (eg, by the system at step 440 or by the user at step 444) survive another iteration. Based on the performance of the model generated in the previous iteration (or iterations), the search engine 110 adjusts the priority of the corresponding modeling technique and allocates processing resources to the modeling technique accordingly. When computational resources become available, the engine uses the available resources to launch model-description-execution tasks based on the updated priorities.

일부 실시예에서, 단계(432)에서, 탐색 엔진(110)은 (예를 들어, 혼합기에 포함시키기 위한 단계적 모델 선택을 사용하여) 상이한 수학적 조합을 사용하여 복수의 모델을 "혼합(blend)"하여 새로운 모델을 생성할 수 있다. 일부 실시예에서, 예측 모델링 시스템(100)은, 사용자가 그 자신의 자동 혼합 기술을 플러그인할 수 있게 하는 모듈형 프레임워크를 제공한다. 일부 실시예에서, 예측 모델링 시스템(100)은, 사용자가 상이한 모델 혼합을 수동으로 특정할 수 있게 한다.In some embodiments, at step 432 the search engine 110 “blends” the plurality of models using different mathematical combinations (eg, using stepwise model selection for inclusion in the mixer). to create a new model. In some embodiments, the predictive modeling system 100 provides a modular framework that allows users to plug in their own automatic blending techniques. In some embodiments, the predictive modeling system 100 allows a user to manually specify different model mixes.

일부 실시예에서, 예측 모델링 시스템(100)은 혼합된 예측 모델을 개발할 때 하나 이상의 이점을 제공할 수 있다. 첫째, 매우 다양한 후보 모델이 혼합하는 데 이용 가능한 경우 혼합은 더 효과적일 수 있다. 또한, 후보 모델 간의 차이가 알고리즘의 단순히 사소한 변화가 아니라 선형 모델, 트리-기반 모델, 지원 벡터 머신 및 최근접 이웃 분류 중에서와 같이 접근에 있어 주요한 차이에 대응할 때, 혼합은 더 효과적일 수 있다. 예측 모델링 시스템(100)은 광범위하게 다양한 모델을 자동으로 생성하고 후보 모델이 어떻게 다른지를 기술하는 메타데이터를 유지함으로써 실질적인 선두 시작을 전달할 수 있다. 예측 모델링 시스템(100)은 또한 예를 들어, 후보 모델들에 걸쳐 변수의 스케일을 자동으로 정규화함으로써 임의의 모델이 혼합된 모델에 통합될 수 있게 하는 프레임워크를 제공할 수 있다. 이 프레임워크는, 사용자가 자동으로 생성된 모델에 사용자 자신이 맞춤화한 또는 독립적으로 생성된 모델을 용이하게 추가할 수 있게 하여 다양성을 더욱 높일 수 있다.In some embodiments, predictive modeling system 100 may provide one or more advantages when developing blended predictive models. First, blending can be more effective when a wide variety of candidate models are available for blending. Also, blending can be more effective when differences between candidate models correspond to major differences in approaches, such as among linear models, tree-based models, support vector machines, and nearest neighbor classification, rather than simply minor changes in the algorithm. The predictive modeling system 100 can deliver a substantial lead start by automatically generating a wide variety of models and maintaining metadata that describes how candidate models differ. The predictive modeling system 100 may also provide a framework that allows any model to be incorporated into a mixed model, for example, by automatically normalizing the scale of a variable across candidate models. This framework can further increase versatility by allowing users to easily add their own customized or independently generated models to automatically generated models.

혼합에 이용 가능한 후보 모델의 다양성을 증가시키는 것에 추가하여, 예측 모델링 시스템(100)은 우수한 혼합을 초래할 수 있는 다수의 사용자 인터페이스 피처 및 분석 피처를 또한 제공한다. 우선, 사용자 인터페이스(120)는 후보 모델 맞춤 및 이중 리프트 차트(dual lift chart)와 같은 그래픽의 몇몇 다른 대안적인 측정을 포함하는 상호 작용 모델 비교를 제공할 수 있어, 사용자가 혼합을 위한 정확하고 상보적인 모델을 용이하게 식별할 수 있다. 둘째, 모델링 시스템(100)은 사용자에게 특정 후보 모델 및 혼합 기술을 선택하거나 후보 모델의 일부 또는 전부를 사용하여 모델링 기술 라이브러리에서 일부 또는 모든 혼합 기술을 자동으로 맞춤화하는 옵션을 제공한다. 네스팅된 상호 유효성 검증 프레임워크는 그 후 혼합기 자체를 튜닝하거나 그 컴포넌트 모델의 하이퍼-파라미터를 튜닝할 때 각 혼합 모델을 등급화하는 데 사용되는 데이터가 사용되지 않는 조건을 실시한다. 이 규율은 사용자에게 대안적인 혼합기 성능의 보다 정확한 비교를 제공할 수 있다. 일부 실시예에서, 모델링 시스템(100)은, 혼합 모델에 대한 연산 시간이 그 가장 느린 컴포넌트 모델의 연산 시간에 접근하도록, 혼합 모델의 프로세싱을 병렬로 구현한다.In addition to increasing the variety of candidate models available for blending, predictive modeling system 100 also provides a number of user interface features and analysis features that can result in good blending. First, the user interface 120 can provide interactive model comparisons, including candidate model fit and some other alternative measures of graphics, such as dual lift charts, so that the user can provide accurate and complementary measures for blending. model can be easily identified. Second, the modeling system 100 provides the user with the option to select specific candidate models and blending techniques, or to automatically customize some or all blending techniques from a library of modeling techniques using some or all of the candidate models. The nested mutual validation framework then enforces the condition that the data used to rank each blend model is not used when tuning the mixer itself or the hyper-parameters of its component models. This discipline can provide users with a more accurate comparison of the performance of alternative mixers. In some embodiments, the modeling system 100 implements the processing of the mixed model in parallel, such that the computation time for the mixed model approaches the computation time of its slowest component model.

도 4로 복귀하여, 방법(400)의 단계(446)에서, 사용자 인터페이스(120)는 최종 결과를 사용자에게 제시한다. 이 표현에 기초하여, 사용자는 (예를 들어, 단계(412)로 복귀함으로써) 데이터 세트를 정제하고, (예를 들어, 단계(444)로 복귀함으로써) 실행 모델링 기술에 대한 자원의 할당을 조정하고, (예를 들어, 단계(430)로 복귀함으로써) 정확도를 향상시키기 위해 하나 이상의 모델링 기술을 수정하고, (예를 들어, 단계(402)로 복귀함으로써) 데이터 세트를 변경하는 것 등을 수행할 수 있다.Returning to FIG. 4 , at step 446 of method 400 , user interface 120 presents the final result to the user. Based on this representation, the user refines the data set (eg, by returning to step 412 ) and adjusts the allocation of resources to the execution modeling technique (eg, by returning to step 444 ). and modify one or more modeling techniques to improve accuracy (e.g., by returning to step 430), modifying the data set (e.g., by returning to step 402), etc. can do.

방법(400)의 단계(448)에서, 검색 공간 평가 또는 그의 일부를 재개하는 것이 아니라, 사용자는 하나 이상의 톱 예측 모델 후보를 선택할 수 있다. 단계(450)에서, 예측 모델링 시스템(100)은 선택된 예측 모델 후보(들)에 대한 홀드아웃 테스트의 결과를 제시할 수 있다. 홀드아웃 테스트 결과는, 이들 후보들의 비교 방법에 대한 최종 척도(final gauge)를 제공할 수 있다. 일부 실시예에서, 적절한 특권을 가진 사용자만이 홀드아웃 테스트 결과를 해제할 수 있다. 후보 예측 모델이 선택될 때까지 홀드아웃 테스트 결과의 해제를 방지하는 것은 성능의 편향되지 않은 평가를 용이하게 할 수 있다. 그러나, 탐색 엔진(110)은, 후보 예측 모델이 선택된 후까지 결과가 숨겨진 채로 있는 한, 모델링 작업 실행 프로세스(예를 들어, 단계(432-444)) 동안 홀드아웃 테스트 결과를 실제로 계산할 수 있다.At step 448 of method 400 , rather than resuming search space evaluation or a portion thereof, the user may select one or more top predictive model candidates. In step 450 , the predictive modeling system 100 may present the results of a holdout test for the selected predictive model candidate(s). The holdout test results can provide a final gauge on how these candidates are compared. In some embodiments, only users with appropriate privileges can release holdout test results. Preventing release of holdout test results until a candidate predictive model is selected may facilitate an unbiased evaluation of performance. However, the search engine 110 may actually compute the holdout test results during the modeling task execution process (eg, steps 432-444) as long as the results remain hidden until after a candidate predictive model is selected.

사용자 인터페이스user interface

도 1로 복귀하면, 사용자 인터페이스(120)는 예측 모델링 공간의 검색을 모니터링 및/또는 가이드하기 위한 도구를 제공할 수 있다. 이러한 도구는 (예를 들어, 데이터 세트의 문제가 있는 변수를 강조 표시하고 데이터 세트의 변수들 간 관계를 식별하는 등에 의해) 예측 문제의 데이터 세트에 대한 통찰력 및/또는 검색 결과에 대한 통찰력을 제공할 수 있다. 일부 실시예에서, 데이터 분석자는 예를 들어 모델링 해결책을 평가하고 비교하는 데 사용되는 메트릭을 특정하고, 적절한 모델링 해결책을 인식하기 위한 기준을 특정하는 것 등에 의해, 검색을 가이드하기 위해 인터페이스를 사용할 수 있다. 따라서, 사용자 인터페이스는 그 자신의 생산성을 향상시키고 및/또는 탐색 엔진(110)의 성능을 향상시키기 위해 분석자에 의해 사용될 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는 검색의 결과를 실시간으로 제시하고, (예를 들어, 상이한 모델링 해결책의 평가 중 검색의 범위 또는 자원의 할당을 조정하기 위해) 사용자가 실시간으로 검색을 가이드할 수 있게 한다. 일부 실시예에서, 사용자 인터페이스(120)는 동일한 예측 문제 및/또는 관련된 예측 문제에 대해 효과적인 다수의 데이터 분석자의 노력을 조정하기 위한 도구를 제공한다.Returning to FIG. 1 , the user interface 120 may provide tools for monitoring and/or guiding the search of the predictive modeling space. These tools provide insight into data sets of prediction problems and/or insights into search results (e.g., by highlighting problematic variables in data sets, identifying relationships between variables in data sets, etc.) can do. In some embodiments, the data analyst may use the interface to guide the search, for example, by specifying metrics used to evaluate and compare modeling solutions, specifying criteria for recognizing appropriate modeling solutions, etc. have. Accordingly, the user interface may be used by the analyst to improve their own productivity and/or to improve the performance of the search engine 110 . In some embodiments, the user interface 120 presents the results of the search in real time and allows the user to guide the search in real time (eg, to adjust the scope of the search or allocation of resources during evaluation of different modeling solutions). make it possible In some embodiments, user interface 120 provides tools for coordinating the efforts of multiple data analysts to be effective on the same prediction problem and/or related prediction problems.

일부 실시예에서, 사용자 인터페이스(120)는 모델링 기술의 라이브러리(130)에 대한 머신-실행 가능 템플릿을 개발하기 위한 도구를 제공한다. 시스템 사용자는 이러한 도구를 사용하여 기존 템플릿을 수정하거나, 새로운 템플릿을 생성하거나, 라이브러리(130)로부터 템플릿을 제거할 수 있다. 이러한 방식으로, 시스템 사용자는 라이브러리(130)를 갱신하여 예측 모델링 연구의 진보를 반영하고 및/또는 독점적 예측 모델링 기술을 포함할 수 있게 한다.In some embodiments, user interface 120 provides tools for developing machine-executable templates for library 130 of modeling techniques. A system user can use these tools to modify an existing template, create a new template, or remove a template from the library 130 . In this way, system users may update library 130 to reflect advances in predictive modeling research and/or to include proprietary predictive modeling techniques.

사용자 인터페이스(120)는, 사용자가 조직 내의 복수의 모델링 프로젝트를 관리하고, 모델링 방법론 계층 구조의 요소를 생성 및 수정하고, 정확한 예측 모델에 대한 포괄적인 검색을 수행하고, 데이터 세트 및 모델 결과에 대한 통찰력을 얻고, 및/또는 새로운 데이터에 대한 예측을 수행하기 위해 완성된 모델을 배치할 수 있게 하는 다양한 인터페이스 컴포넌트를 포함할 수 있다.User interface 120 allows users to manage multiple modeling projects within an organization, create and modify elements of a modeling methodology hierarchy, perform comprehensive searches for accurate predictive models, and access data sets and model results. It may include various interface components that allow deploying the finished model to gain insights and/or make predictions on new data.

일부 실시예에서, 사용자 인터페이스(120)는 관리자, 기술 개발자, 모델 빌더 및 관측자의 4개 유형의 사용자를 구분한다. 관리자는 프로젝트에 대한 사람 및 컴퓨팅 자원의 할당을 제어할 수 있다. 기술 개발자는 모델링 기술과 컴포넌트 작업을 생성 및 수정할 수 있다. 모델 빌더는, 기술 및 작업에 대한 약간의 조정을 할 수 있지만, 양호한 모델을 검색하는 데 중점을 둔다. 관측자는 프로젝트 진행 및 모델링 결과의 특정 양태를 볼 수 있지만, 데이터에 대한 임의의 변경을 수행하거나 임의의 모델-구축을 시작하는 것으로부터 금지될 수 있다. 개인은 특정 프로젝트 또는 복수 프로젝트에서 하나 초과의 역할을 수행할 수 있다.In some embodiments, user interface 120 distinguishes four types of users: administrators, technical developers, model builders, and observers. Administrators can control the allocation of people and computing resources to the project. Technology developers can create and modify modeling techniques and component tasks. The model builder focuses on finding good models, although you can make some adjustments to your skills and work. An observer may see certain aspects of the project progress and modeling results, but may be prevented from making any changes to the data or from initiating any model-building. An individual may perform more than one role in a particular project or multiple projects.

관리자로서 역할하는 사용자는 프로젝트 파라미터를 설정하고, 사용자에 대한 프로젝트 책임을 할당하고, 컴퓨팅 자원을 프로젝트에 할당하기 위해 사용자 인터페이스(120)의 프로젝트 관리 컴포넌트에 액세스할 수 있다. 일부 실시예에서, 관리자는 프로젝트 관리 컴포넌트를 사용하여 복수의 프로젝트를 그룹 또는 계층 구조로 조직할 수 있다. 그룹 내의 모든 프로젝트는 그룹의 설정을 상속받을 수 있다. 계층 구조에서, 프로젝트의 모든 자녀는 프로젝트의 설정을 상속받을 수 있다. 일부 실시예에서, 충분한 권한을 가진 사용자는 상속된 설정을 무시할 수 있다. 일부 실시예에서, 충분한 권한을 가진 사용자는 설정을 상이한 섹션으로 추가로 분할하여, 대응하는 권한을 가진 사용자만이 이를 변경할 수 있다. 일부 경우에, 관리자는 프로젝트의 조직과 직교로 특정 자원에 대한 액세스를 허용할 수 있다. 예를 들어, 명시적으로 금지되지 않는 한, 특정 기술과 작업이 모든 프로젝트에서 이용 가능하게 될 수 있다. 명시적으로 허용되지 않는 한, 다른 것은 금지될 수 있다. 또한, 일부 자원은 사용자 단위로 할당될 수 있으므로, 그 권리를 소유한 사용자가 특정 프로젝트에 할당된 경우에만 프로젝트가 자원에 액세스할 수 있다.A user acting as an administrator can access the project management component of the user interface 120 to set project parameters, assign project responsibilities to the user, and allocate computing resources to the project. In some embodiments, an administrator may use a project management component to organize multiple projects into groups or hierarchies. All projects within a group can inherit the group's settings. In a hierarchy, all children of a project can inherit the settings of the project. In some embodiments, users with sufficient privileges may override inherited settings. In some embodiments, users with sufficient privileges may further divide the settings into different sections, so that only users with corresponding privileges can change them. In some cases, an administrator may grant access to specific resources orthogonal to the project's organization. For example, certain skills and tasks may be made available to any project, unless expressly prohibited. Others may be prohibited unless expressly permitted. In addition, some resources can be assigned on a per-user basis, so that a project can access resources only when the user who owns those rights is assigned to a specific project.

사용자 관리에서, 관리자는 시스템에 허가된 모든 사용자의 그룹, 허가된 역할 및 시스템-레벨 권한을 제어할 수 있다. 일부 실시예에서, 관리자는 대응 그룹에 사용자를 추가하고 액세스 크리덴셜의 일부 형식을 이들에 발행하여 사용자를 시스템에 추가할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는 사용자 이름과 패스워드, 통합 인증 프레임워크(예를 들어, OAuth), 하드웨어 토큰(예를 들어, 스마트 카드) 등에 한정되지 않지만 이를 포함하는 상이한 종류의 크리덴셜을 지원할 수 있다.In user management, an administrator can control the groups, permitted roles, and system-level privileges of all users permitted to the system. In some embodiments, administrators can add users to the system by adding users to corresponding groups and issuing them some form of access credentials. In some embodiments, the user interface 120 provides different kinds of credentials including, but not limited to, a username and password, a unified authentication framework (eg, OAuth), a hardware token (eg, a smart card), and the like. can support

일단 허가되면, 관리자는, 특정 사용자가 임의의 프로젝트에 대해 가정하는 디폴트 역할을 갖는 것을 특정할 수 있다. 예를 들어, 특정 프로젝트에 대해 관리자에 의해 다른 역할에 대해 특별히 인증되지 않는 한, 특정 사용자는 관측자로 지정될 수 있다. 다른 사용자는, 관리자에 의해 특별히 제외하지 않는 한 모든 프로젝트에 대한 기술 개발자로서 제공될 수 있으며, 다른 사용자는 프로젝트의 특정 그룹만 또는 프로젝트 계층 구조의 브랜치에 대한 기술 개발자로서 제공될 수 있다. 디폴트 역할에 추가하여, 관리자는 사용자에게 시스템 레벨에서 보다 구체적인 권한을 추가로 할당할 수 있다. 예를 들어, 관리자는 특정 유형의 컴퓨팅 자원에 대한 액세스를 허가할 수 있으며, 일부 기술 개발자 및 모델 빌더는 빌더 내의 특정 피처에 액세스할 수 있고; 일부 모델 빌더는 새로운 프로젝트를 시작하거나, 주어진 레벨보다 많은 연산 자원을 소비하거나, 자신이 소유하지 않은 프로젝트에 새로운 사용자를 초대하도록 인증될 수 있다.Once granted, the administrator can specify that a particular user has a assumed default role for any project. For example, certain users can be designated as observers, unless specifically authorized for other roles by the administrator for a particular project. Other users may serve as technical developers for all projects unless specifically excluded by the administrator, and other users may serve only as technical developers for a specific group of projects or branches of the project hierarchy. In addition to the default roles, administrators can assign additional, more specific privileges to users at the system level. For example, an administrator may grant access to certain types of computing resources, some technical developers and model builders may access certain features within a builder; Some model builders may be authorized to start a new project, consume more computational resources than a given level, or invite new users to a project they do not own.

일부 실시예에서, 관리자는 프로젝트 레벨에서 액세스, 허가 및 책임을 할당할 수 있다. 액세스는 특정 프로젝트 내의 임의의 정보에 액세스하는 능력을 포함할 수 있다. 권한은 프로젝트에 대한 특정 동작을 수행하는 능력을 포함할 수 있다. 액세스 및 권한은 시스템-수준 권한을 무시하거나 보다 세부적인 제어를 제공할 수 있다. 전자의 예로서, 통상적으로 전체 빌더 권한을 가진 사용자는 특정 프로젝트에 대한 부분 빌더 권한으로 제한될 수 있다. 후자의 예로서, 특정 사용자는 새로운 데이터를 기존 프로젝트로 로드하는 것에서 제한될 수 있다. 책임은, 사용자가 프로젝트를 위해 완료하도록 예상되는 액션 항목을 포함할 수 있다.In some embodiments, administrators may assign access, permissions, and responsibilities at the project level. Access may include the ability to access any information within a particular project. Privileges may include the ability to perform specific actions on a project. Access and privileges can override system-level privileges or provide more granular control. As an example of the former, a user who typically has full builder rights may be limited to partial builder rights for a specific project. As an example of the latter, certain users may be restricted from loading new data into existing projects. Responsibilities may include action items a user is expected to complete for a project.

개발자로서 역할하는 사용자는 인터페이스의 빌더 영역에 액세스하여 모델링 방법, 기술 및 작업을 생성 및 수정할 수 있다. 전술한 바와 같이, 각 빌더는 대응 논리적 동작을 수행하는 상이한 유형의 사용자 인터페이스를 가진 하나 이상의 도구를 제시할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는, 개발자가 기술에 첨부된 메타데이터를 편집하기 위해 "속성" 시트를 사용할 수 있게 할 수 있다. 기술은 또한 특정 작업에 대한 변수에 대응하는 튜닝 파라미터를 가질 수 있다. 개발자는 이러한 튜닝 파라미터를 기술-레벨 속성 시트에 게시하여, 디폴트 값과 모델 빌더가 이러한 디폴트를 무시할 수 있는지 여부를 특정할 수 있다.Users acting as developers can access the Builder area of the interface to create and modify modeling methods, techniques, and tasks. As noted above, each builder may present one or more tools with different types of user interfaces that perform corresponding logical operations. In some embodiments, user interface 120 may enable a developer to use a “properties” sheet to edit metadata attached to a description. Techniques may also have tuning parameters that correspond to variables for a particular task. Developers can publish these tuning parameters to a skill-level property sheet, specifying default values and whether model builders can override these defaults.

일부 실시예에서, 사용자 인터페이스(120)는 조건부 로직에 대한 임의의 빌트인 동작과 함께 작업의 계층적 지향 그래프를 특정하고, 출력을 필터링하고, 출력을 변환하고, 출력을 분할하고, 입력을 조합하고, 서브-그래프에 대해 반복하는 것 등을 위해 그래픽 흐름도 도구를 제공할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는 각 작업에 대해 설정될 수 있는 속성을 포함하여, 리프-레벨 작업을 구현하기 위한 기존 소프트웨어 주위에 래퍼(wrapper)를 생성하기 위한 설비를 제공할 수 있다.In some embodiments, user interface 120 specifies a hierarchically directed graph of tasks, filters outputs, transforms outputs, splits outputs, combines inputs, and , can provide graphical flowchart tools for iterating over sub-graphs, and the like. In some embodiments, user interface 120 may provide facilities for creating a wrapper around existing software for implementing leaf-level tasks, including properties that may be set for each task. .

일부 실시예에서, 사용자 인터페이스(120)는 리프-레벨 작업을 구현하기 위한 상호 작용 개발 환경(IDE)에 대한 빌트인 액세스를 고급 개발자에게 제공할 수 있다. 개발자는 대안적으로 외부 환경의 컴포넌트를 코딩하고 리프-레벨 작업으로서 코드를 래핑할 수 있지만, 이러한 환경에 직접 액세스할 수 있으면 더 편리할 수 있다. 이러한 실시예에서, IDE 자체는 인터페이스에 래핑되어 논리적으로 작업 빌더에 통합될 수 있다. 사용자 관점에서 볼 때, IDE는 동일한 인터페이스 프레임워크 내에서 작업 빌더와 동일한 연산 인프라스트럭처 상에서 실행될 수 있다. 이러한 기능은 고급 개발자가 개발 및 수정 기술을 보다 신속하게 반복하게 할 수 한다. 일부 실시예는 동일한 리프-레벨 작업을 동시에 프로그래밍하는 복수의 개발자 간의 조정을 용이하게 하는 코드 협업 피처를 추가로 제공할 수 있다.In some embodiments, user interface 120 may provide advanced developers with built-in access to an interactive development environment (IDE) for implementing leaf-level tasks. Developers can alternatively code components in an external environment and wrap the code as leaf-level operations, but it may be more convenient to have direct access to these environments. In such an embodiment, the IDE itself can be wrapped in an interface and logically integrated into the task builder. From the user's point of view, the IDE can run on the same computational infrastructure as the Job Builder within the same interface framework. These capabilities allow advanced developers to iterate their development and revision techniques more quickly. Some embodiments may further provide a code collaboration feature to facilitate coordination among multiple developers concurrently programming the same leaf-level task.

모델 빌더는 개발자에 의해 생성된 기술을 활용하여 자신의 특정 데이터 세트에 대한 예측 모델을 구축할 수 있다. 상이한 모델 빌더는 상이한 레벨의 경험을 가질 수 있으므로, 사용자 인터페이스와는 상이한 지원을 필요로 할 수 있다. 비교적 새로운 사용자의 경우, 사용자 인터페이스(120)는 가능한 한 자동 프로세스를 제시할 수 있지만, 여전히 사용자에게 옵션을 탐색할 능력을 제공할 수 있고, 이에 의해 예측 모델링에 대해 더 많이 학습할 수 있다. 중간 사용자의 경우, 사용자 인터페이스(120)는 특정 문제가 얼마나 쉽게 해결될지를 신속하게 평가하고, 기존의 예측 모델이, 예측 모델링 시스템(100)이 자동으로 생성할 수 있는 것에 대해 어떻게 적층하는지 비교하고, 궁극적으로 실질적인 수동 튜닝으로부터 이익을 얻을 복잡한 프로젝트에 대한 가속화된 개시를 얻는 것을 용이하게 하기 위해 정보를 제시할 수 있다. 고급 사용자의 경우, 사용자 인터페이스(120)는 기존의 예측 모델에 대한 정확도의 약간의 여분의 소수점 이하의 자리수의 추출, 작업한 문제에 대한 새로운 기술의 적용 가능성의 신속한 평가, 및 그 조직이 직면할 수 있는 문제의 전체 클래스에 대한 기술의 개발을 용이하게 할 수 있다. 고급 사용자에 대한 지식을 포착함으로써, 일부 실시예는 조직의 나머지에 전반적으로 그 지식의 전파를 용이하게 한다.Model builders can leverage the skills generated by developers to build predictive models for their specific data sets. Different model builders may have different levels of experience and thus may require different support than the user interface. For relatively new users, the user interface 120 may present as automated processes as possible, but may still provide the user with the ability to explore options, thereby learning more about predictive modeling. For intermediate users, the user interface 120 quickly evaluates how easily a particular problem will be solved, compares how existing predictive models stack up against what the predictive modeling system 100 can automatically generate, and , can present information to facilitate obtaining accelerated start-up for complex projects that will ultimately benefit from substantial manual tuning. For advanced users, the user interface 120 provides for the extraction of a few extra decimal places of accuracy for existing predictive models, a rapid evaluation of the applicability of the new technique to the problem being worked on, and the Can facilitate the development of skills for an entire class of possible problems. By capturing knowledge about advanced users, some embodiments facilitate the dissemination of that knowledge across the rest of the organization.

이러한 사용자 요구의 폭을 지원하기 위해, 사용자 인터페이스(120)의 일부 실시예는 모델 구축 프로세스를 반영하는 인터페이스 도구의 시퀀스를 제공한다. 또한, 각 도구는 기본에서 고급까지 다양한 피처를 제공할 수 있다. 모델 구축 프로세스의 제1 단계는 데이터 세트를 로드 및 준비하는 것을 포함할 수 있다. 전술한 바와 같이, 사용자는 파일을 업로드하거나 온라인 시스템으로부터 어떻게 데이터에 액세스하는지를 특정할 수 있다. 프로젝트 그룹 또는 계층 구조를 모델링하는 관점에서, 사용자는 또한 현재 프로젝트에 사용할 부모 데이터 세트의 부분과 추가할 부분을 특정할 수 있다.To support this breadth of user needs, some embodiments of user interface 120 provide a sequence of interface tools that reflect the model building process. Additionally, each tool can provide a variety of features, from basic to advanced. A first step in the model building process may include loading and preparing a data set. As mentioned above, a user can specify how to upload a file or access data from an online system. In terms of modeling project groups or hierarchies, users can also specify which parts of the parent data set to use for the current project and which parts to add.

기본 사용자에 대해, 예측 모델링 시스템(100)은, 데이터 세트가 특정된 후에 모델을 구축하는 것으로 즉시 진행할 수 있으며, 파싱 불가능한 데이터, 양호한 결과를 예측하기에는 너무 적은 관측치, 합리적인 양의 시간으로 실행하기에는 너무 많은 관측치, 너무 많은 누락값, 또는 그 분산이 비정상적인 결과로 이어질 수 있는 변수에 한정되지 않지만 이를 포함하는 난처한 문제를 사용자 인터페이스(120)가 플래깅하는 경우에만 중지된다. 중간 사용자에 대해, 사용자 인터페이스(120)는 데이터 세트 특성의 테이블 및 변수 중요도의 그래프, 변수 효과 및 효과 핫스팟을 제시함으로써 데이터를 보다 깊이 이해하는 것을 용이하게 할 수 있다. 또한, 사용자 인터페이스(120)는 상관 매트릭스, 부분 의존도 플롯 및/또는 k- 수단 및 계층적 클러스터링과 같은 비감독 머신-학습 알고리즘의 결과에 한정되지 않지만 이를 포함하는 시각화 도구를 제공함으로써 변수들 간의 관계의 이해 및 시각화를 용이하게 할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는, 기존 피처 또는 이들의 조합을 변환하는 공식을 특정함으로써 고급 사용자가 완전히 새로운 데이터 세트 피처를 생성할 수 있게 한다.For a basic user, the predictive modeling system 100 can immediately proceed to building a model after the data set has been specified, with unparsable data, too few observations to predict good results, too few observations to run in a reasonable amount of time. It stops only when the user interface 120 flags an embarrassing problem including, but not limited to, many observations, too many missing values, or variables whose variance could lead to an anomalous result. For intermediate users, user interface 120 may facilitate a deeper understanding of data by presenting tables of data set characteristics and graphs of variable importance, variable effects, and effect hotspots. Additionally, the user interface 120 provides visualization tools including, but not limited to, correlation matrices, partial dependence plots, and/or results of unsupervised machine-learning algorithms such as k-means and hierarchical clustering, thereby providing visualization tools for relationships between variables. can facilitate the understanding and visualization of In some embodiments, user interface 120 allows advanced users to create entirely new data set features by specifying formulas to transform existing features or combinations thereof.

일단 데이터 세트가 로드되면, 사용자는 최적화될 모델-맞춤 메트릭을 특정할 수 있다. 기본 사용자에 대해, 예측 모델링 시스템(100)은 모델-맞춤 메트릭을 선택할 수 있고, 사용자 인터페이스(120)는 선택에 대한 설명을 제시할 수 있다. 중간 사용자에 대해, 사용자 인터페이스(120)는, 사용자가 특정 데이터 세트에 대한 상이한 메트릭을 선택함에 있어 상충 관계를 이해하는 것을 돕도록 정보를 제시할 수 있다. 고급 사용자에 대해, 사용자 인터페이스(120)는, 사용자가 탐색 엔진(110)에 의해 수집된 저-레벨 성능 데이터에 기초하여 공식(예를 들어, 목적 함수)을 작성함으로써, 또는 심지어 커스텀 메트릭 계산 코드를 업로드함으로써 커스텀 메트릭을 특정할 수 있게 할 수 있다.Once the data set is loaded, the user can specify the model-fit metrics to be optimized. For a primary user, the predictive modeling system 100 may select a model-fit metric, and the user interface 120 may present a description of the selection. For an intermediate user, user interface 120 may present information to help the user understand trade-offs in selecting different metrics for a particular data set. For advanced users, the user interface 120 allows the user to write formulas (eg, objective functions) based on low-level performance data collected by the search engine 110 , or even custom metric calculation code. You can specify custom metrics by uploading

데이터 세트가 로드되고 모델-맞춤 메트릭이 선택되면, 사용자는 탐색 엔진을 론칭할 수 있다. 기본 사용자에 대해, 탐색 엔진(110)은 모델링 기술에 대한 디폴트 우선 순위 설정을 사용할 수 있고, 사용자 인터페이스(120)는 모델 성능, 데이터 세트로 얼마나 멀리 실행이 진행되었는지, 및 연산 자원의 일반적인 소비에 대한 고-레벨 정보를 제공할 수 있다. 중간 사용자에 대해, 사용자 인터페이스(120)는, 사용자가 초기 우선 순위들 중 일부를 고려하고 약간 조정하는 기술의 서브세트를 특정할 수 있게 할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는 보다 세부적인 성능 및 진행 데이터를 제공하여, 중간 사용자가 전술한 바와 같이 인-플라이트(in-flight) 조정을 할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는 중간 사용자에게 연산 자원 소비의 더 많은 직관 및 제어를 제공한다. 일부 실시예에서, 사용자 인터페이스(120)는 고려된 기술 및 그 우선 순위, 이용 가능한 모든 성능 데이터 및 중요한(예를 들어, 완전한) 자원 소비의 제어의 중요한(예를 들어, 완전한) 제어를 고급 사용자에게 제공할 수 있다. 상이한 레벨의 사용자에게 별개의 인터페이스를 제공하거나 디폴트로 보다 낮은 고급 사용자를 위한 보다 고급 피처를 "붕괴"시키는 것에 의해, 사용자 인터페이스(120)의 일부 실시예는 그 대응 레벨에서 사용자를 지원할 수 있다.Once the data set is loaded and model-fit metrics are selected, the user can launch the search engine. For a basic user, the search engine 110 may use default priority settings for modeling techniques, and the user interface 120 provides information about model performance, how far the run has progressed into the data set, and the general consumption of computational resources. It can provide high-level information about For an intermediate user, the user interface 120 may allow the user to specify a subset of techniques to consider and slightly adjust some of the initial priorities. In some embodiments, user interface 120 provides more detailed performance and progress data, allowing intermediate users to make in-flight adjustments as described above. In some embodiments, the user interface 120 provides the intermediate user more intuition and control of computational resource consumption. In some embodiments, user interface 120 provides advanced users with significant (eg, complete) control of the considered technologies and their priorities, all available performance data, and control of critical (eg, complete) resource consumption. can be provided to Some embodiments of user interface 120 may assist users at their corresponding levels, either by providing separate interfaces to different levels of users, or by default "collapsing" more advanced features for less advanced users.

검색 공간의 탐색 중에 그리고 그 후에, 사용자 인터페이스는 하나 이상의 모델링 기술의 수행에 관한 정보를 제시할 수 있다. 일부 성능 정보는 테이블 포맷으로 표시될 수 있지만 다른 성능 정보는 그래픽 포맷으로 표시될 수 있다. 예를 들어, 테이블 포맷으로 제시된 정보는 기술별 모델 성능의 비교, 평가된 데이터의 부분, 기술 속성 또는 연산 자원의 현재 소비에 한정되지 않지만 이를 포함할 수 있다. 그래픽 포맷으로 제시되는 정보는 모델링 절차의 작업의 지향 그래프, 데이터 세트의 상이한 분할에 걸친 모델 성능의 비교, 수신기 작동 특성 및 리프트 차트와 같은 모델 성능의 표현, 예측값 대 실제값 및 시간 경과에 따른 연산 자원의 소비를 포함할 수 있지만 이에 한정되지 않는다. 사용자 인터페이스(120)는 어느 유형의 새로운 성능 정보의 용이한 포함을 허용하는 모듈식 사용자 인터페이스 프레임워크를 포함할 수 있다. 또한, 일부 실시예는 각 데이터 분할 및/또는 각 기술에 대한 일부 유형의 정보의 표시를 허용할 수 있다.During and after navigation of the search space, the user interface may present information regarding the performance of one or more modeling techniques. Some performance information may be displayed in a table format while other performance information may be displayed in a graphical format. For example, information presented in a table format may include, but is not limited to, comparison of model performance by technology, a portion of evaluated data, technology attributes, or current consumption of computational resources. Information presented in graphical format includes directed graphs of the tasks of the modeling procedure, comparison of model performance across different partitions of data sets, representation of model performance such as receiver operating characteristics and lift charts, predicted versus actual values, and calculations over time. may include, but is not limited to, consumption of resources. User interface 120 may include a modular user interface framework that allows for the easy inclusion of any type of new capability information. Additionally, some embodiments may allow for the display of some type of information for each data segmentation and/or each technology.

전술한 바와 같이, 사용자 인터페이스(120)의 일부 실시예는 복수의 프로젝트에 대한 복수의 사용자의 협업을 지원한다. 프로젝트에 걸쳐, 사용자 인터페이스(120)는, 사용자가 데이터, 모델링 작업 및 모델링 기술을 공유하도록 허용할 수 있다. 프로젝트 내에서, 사용자 인터페이스(120)는, 사용자가 데이터, 모델 및 결과를 공유하도록 허용할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는, 사용자가 프로젝트의 속성을 수정하고 프로젝트에 할당된 자원을 사용하도록 허용할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는, 복수의 사용자가 프로젝트 데이터를 수정하고 모델을 프로젝트에 추가한 다음 이들 기여도를 비교하도록 허용할 수 있다. 일부 실시예에서, 사용자 인터페이스(120)는, 어떤 사용자가 프로젝트에 대한 특정의 변경을 하였는지, 변경이 언제 이루어졌는지, 및 사용자가 어떤 프로젝트 자원을 사용했는지를 식별할 수 있다.As noted above, some embodiments of user interface 120 support the collaboration of multiple users on multiple projects. Across projects, user interface 120 may allow users to share data, modeling tasks, and modeling techniques. Within a project, user interface 120 may allow users to share data, models, and results. In some embodiments, user interface 120 may allow a user to modify properties of a project and use resources allocated to the project. In some embodiments, user interface 120 may allow multiple users to modify project data, add models to the project, and then compare their contributions. In some embodiments, user interface 120 may identify which users made particular changes to the project, when the changes were made, and which project resources the user used.

모델 배치 엔진model placement engine

모델 배치 엔진(140)은 동작 환경에서 예측 모델을 배치하기 위한 도구를 제공한다. 일부 실시예에서, 모델 배치 엔진(140)은 배치된 예측 모델의 성능을 모니터링하고, 배치된 모델을 생성한 모델링 기술과 연관된 성능 메타데이터를 갱신하여, 성능 데이터가 배치된 모델의 성능을 정확하게 반영한다.The model deployment engine 140 provides tools for deploying predictive models in an operational environment. In some embodiments, the model deployment engine 140 monitors the performance of the deployed predictive model and updates performance metadata associated with the modeling technique that generated the deployed model, so that the performance data accurately reflects the performance of the deployed model. do.

사용자는, 맞춤화된 모델이 현장 테스트를 보증하거나 값을 추가할 수 있다고 믿을 때 맞춤화된 예측 모델을 배치할 수 있다. 일부 실시예에서, 사용자 및 외부 시스템은 (예를 들어, 예측 모델링 시스템(100)의 인터페이스 서비스 층에서) 예측 모듈에 액세스하고, 사용될 하나 이상의 예측 모델을 특정하고, 새로운 관측치를 제공할 수 있다. 그 후, 예측 모듈은 그 모델에 의해 제공된 예측을 반환할 수 있다. 일부 실시예에서, 관리자는, 어떤 사용자 및 외부 시스템이 이 예측 모듈에 대한 액세스를 갖는지를 제어할 수 있고 및/또는 단위 시간당 허용되는 예측의 수와 같은 사용 제한을 설정할 수 있다.Users can deploy customized predictive models when they believe the customized model can warrant field testing or add value. In some embodiments, users and external systems may access the prediction module (eg, at the interface service layer of the predictive modeling system 100 ), specify one or more predictive models to be used, and provide new observations. The prediction module may then return the predictions provided by the model. In some embodiments, administrators may control which users and external systems have access to this prediction module and/or may set usage restrictions, such as the number of predictions allowed per unit time.

각각의 모델에 대해, 탐색 엔진(110)은 계수 및 하이퍼-파라미터값을 포함하여, 모델을 생성하는 데 사용된 모델링 기술의 기록 및 맞춤화 후의 모델의 상태를 저장할 수 있다. 각 기술이 이미 머신-실행 가능하기 때문에, 이 값은, 실행 엔진이 새로운 관측치 데이터에 대한 예측을 생성하는 데 충분할 수 있다. 일부 실시예에서, 모델의 예측은 모델링 기술에서 설명된 사전-프로세싱 및 모델링 단계를 새로운 입력 데이터의 각 인스턴스에 적용함으로써 생성될 수 있다. 그러나 일부 경우에, 장래의 예측 계산의 속도를 높이는 것이 가능할 수 있다. 예를 들어, 맞춤화된 모델은 특정 변수의 값을 몇몇 독립적인 확인을 행할 수 있다. 이러한 확인의 일부 또는 전부를 조합한 다음 편리할 때 단순히 참조하는 것은 예측을 생성하는 데 사용되는 총 연산량을 줄일 수 있다. 마찬가지로, 혼합 모델의 몇몇 컴포넌트 모델이 동일한 데이터 변환을 수행할 수 있다. 따라서, 일부 실시예는 중복 계산을 식별하고, 단지 한 번만 이를 수행하고, 이를 사용하는 컴포넌트 모델에서 계산 결과를 참조함으로써 연산 시간을 감소시킬 수 있다.For each model, the search engine 110 may store the state of the model after customization and a record of the modeling techniques used to create the model, including coefficients and hyper-parameter values. As each technique is already machine-executable, this value may be sufficient for the execution engine to generate predictions for new observational data. In some embodiments, predictions of the model may be generated by applying the pre-processing and modeling steps described in modeling techniques to each instance of new input data. However, in some cases, it may be possible to speed up future prediction calculations. For example, a customized model can make several independent checks of the values of certain variables. Combining some or all of these checks and then simply referencing them when convenient can reduce the total amount of computation used to generate the prediction. Similarly, several component models of a mixed model may perform the same data transformation. Accordingly, some embodiments can reduce computation time by identifying redundant computations, performing them only once, and referencing the computation results in a component model that uses them.

일부 실시예에서, 배치 엔진(140)은 병렬 프로세싱을 위한 기회를 식별함으로써 예측 모델의 성능을 향상시키고, 이에 의해 기본 하드웨어가 복수의 명령어를 병렬로 실행할 수 있을 때 각각의 예측을 수행하는 응답 시간을 감소시킨다. 일부 모델링 기술은 일련의 단계를 순차적으로 설명할 수 있지만, 실제로 일부 단계는 논리적으로 독립적일 수 있다. 각 단계 중의 데이터 흐름을 조사함으로써, 배치 엔진(140)은 논리적 독립의 상황을 식별한 다음 예측 모델의 실행을 재구성할 수 있어 독립적인 단계가 병렬로 실행된다. 혼합 모델은, 임의의 공통 데이터 변환이 완료되면 구성 예측 모델이 병렬로 실행될 수 있기 때문에, 특별한 클래스의 병렬화를 제시할 수 있다.In some embodiments, the placement engine 140 improves the performance of predictive models by identifying opportunities for parallel processing, whereby the response time to perform each prediction when the underlying hardware can execute multiple instructions in parallel. reduces the While some modeling techniques may describe a sequence of steps sequentially, in practice some steps may be logically independent. By examining the data flow during each step, the placement engine 140 can identify situations of logical independence and then reconstruct the execution of the predictive model so that the independent steps are executed in parallel. Mixed models can present a special class of parallelism, since the construct prediction models can be run in parallel once any common data transformations are complete.

일부 실시예에서, 배치 엔진(140)은 예측 모델의 상태를 메모리에 캐싱(caching)할 수 있다. 이 접근법을 사용하면, 동일한 모델의 연속적인 예측 요청이 모델 상태를 로드하는 데 시간을 소비하지 않을 수 있다. 상대적으로 적은 수의 관측치에 대한 예측에 대한 많은 요청이 있는 경우 캐싱이 특히 잘 작동될 수 있으므로, 이 로딩 시간은 전체 실행 시간 중 잠재적으로 큰 부분이다.In some embodiments, placement engine 140 may cache the state of the predictive model in memory. Using this approach, successive prediction requests of the same model may not spend time loading the model state. This loading time is a potentially large fraction of the total execution time, as caching can work particularly well when there are many requests for predictions for a relatively small number of observations.

일부 실시예에서, 배치 엔진(140)은 서비스-기반 및 코드-기반의 적어도 2개의 예측 모델의 구현을 제공할 수 있다. 서비스-기반 예측에 대해, 계산은 후술하는 바와 같이, 분산 컴퓨팅 인프라스트럭처 내에서 실행된다. 최종 예측 모델은 분산 컴퓨팅 인프라스트럭처의 데이터 서비스 층에 저장될 수 있다. 사용자 또는 외부 시스템이 예측을 요청할 때, 어떤 모델이 사용될 것인지 그리고 적어도 하나의 새로운 관측치를 제공하는지를 나타낼 수 있다. 그 후, 예측 모듈은 데이터 서비스 층 또는 모듈의 메모리 내 캐시로부터 모델을 로드하고, 제출된 관측치가 원래 데이터 세트의 구조와 매칭되는지를 검증하고, 각 관측에 대한 예측값을 연산할 수 있다. 일부 구현에서, 예측 모델은 클라우드 작업자의 전용 풀(pool)에서 실행될 수 있으며, 이에 의해 낮은 분산의 응답 시간을 갖는 예측의 생성을 용이하게 한다.In some embodiments, placement engine 140 may provide implementations of at least two predictive models, service-based and code-based. For service-based prediction, the computation is performed within a distributed computing infrastructure, as described below. The final predictive model may be stored in the data service layer of the distributed computing infrastructure. When a user or external system requests a prediction, it may indicate which model will be used and provide at least one new observation. The prediction module can then load the model from the data service layer or the module's in-memory cache, verify that the submitted observations match the structure of the original data set, and compute a prediction value for each observation. In some implementations, the predictive model may be run on a dedicated pool of cloud workers, thereby facilitating the creation of predictions with low variance response times.

서비스-기반 예측은 상호 작용으로 또는 API를 통해 발생할 수 있다. 상호 작용 예측에 대해, 사용자는 각각의 새로운 관측치에 대한 피처의 값을 입력하거나 하나 이상의 관측치에 대한 데이터를 포함하는 파일을 업로드할 수 있다. 그 후, 사용자는 사용자 인터페이스(120)를 통해 직접 예측을 수신하거나 파일로서 다운로드할 수 있다. API 예측에 대해, 외부 시스템은 로컬 또는 원격 API를 통해 예측 모듈에 액세스하고, 하나 이상의 관측치를 제출하고, 반환으로 대응하는 계산된 예측을 수신할 수 있다.Service-based prediction can occur interactively or through APIs. For interactive prediction, a user can enter a value for a feature for each new observation or upload a file containing data for one or more observations. The user can then receive the prediction directly via user interface 120 or download it as a file. For API prediction, an external system may access the prediction module through a local or remote API, submit one or more observations, and receive a corresponding computed prediction in return.

배치 엔진(140)의 몇몇 구현은, 조직이 서비스-기반 예측을 수행할 목적으로 분산 컴퓨팅 인프라스트럭처의 하나 이상의 소형화된 인스턴스를 생성하도록 허용할 수 있다. 분산 컴퓨팅 인프라스트럭처의 인터페이스 층에서, 이러한 각 인스턴스는 사용자-관련 기능에 액세스하지 않고도 외부 시스템에 의해 액세스할 수 있는 모니터링 및 예측 모듈의 부분을 사용할 수 있다. 분석 서비스 층은 기술 IDE 모듈을 사용하지 않을 수 있으며 이 층의 나머지 모듈은 제거되어 예측 요청을 서비스하기 위해 최적화될 수 있다. 데이터 서비스 층은 사용자 또는 모델-구축 데이터 관리를 사용할 수 없다. 이러한 독립형 예측 인스턴스는 클라우드 자원의 병렬 풀에 배치되거나, 다른 물리적 위치로 분산되거나, 심지어 "예측 어플라이언스"로 작동하는 하나 이상의 전용 머신에 다운로드될 수 있다.Some implementations of deployment engine 140 may allow an organization to create one or more miniaturized instances of a distributed computing infrastructure for the purpose of performing service-based prediction. At the interface layer of the distributed computing infrastructure, each such instance may use portions of the monitoring and prediction modules that are accessible by external systems without access to user-related functions. The analytics service layer may not use the technical IDE module and the remaining modules in this layer may be removed and optimized to service prediction requests. The data service layer cannot use user or model-building data management. These standalone prediction instances can be deployed in a parallel pool of cloud resources, distributed to different physical locations, or even downloaded to one or more dedicated machines acting as “prediction appliances”.

전용 예측 인스턴스를 생성하기 위해, 사용자는 예를 들어, 클라우드 인스턴스의 세트인지 또는 전용 하드웨어의 세트인지 여부와 같은 타겟 컴퓨팅 인프라스트럭처를 특정할 수 있다. 그 후 대응 모듈이 제공되고 타겟 컴퓨팅 인프라스트럭처에 설치하거나 설치를 위해 패키지화할 수 있다. 사용자는 초기 예측 모델의 세트로 인스턴스를 구성하거나 "빈(blank)" 인스턴스를 생성할 수 있다. 초기 설치 후, 사용자는 새로운 것을 설치하거나 메인 설치로부터 기존 것을 갱신함으로써 이용 가능한 예측 모델을 관리할 수 있다.To create a dedicated prediction instance, a user may specify a target computing infrastructure, such as, for example, whether it is a set of cloud instances or a set of dedicated hardware. The corresponding module is then provided and can be installed on the target computing infrastructure or packaged for installation. Users can construct instances from a set of initial predictive models or create “blank” instances. After initial installation, users can manage available predictive models by installing new ones or updating existing ones from the main installation.

코드-기반 예측에 대해, 배치 엔진(140)은 특정 모델에 기초하여 예측을 계산하기 위한 소스 코드를 생성할 수 있고, 사용자는 소스 코드를 다른 소프트웨어에 통합할 수 있다. 모델이, 그 리프-레벨 작업이 모두 사용자에 의해 요청된 것과 동일한 프로그래밍 언어로 구현되는 기술에 기초할 때, 배치 엔진(140)은 리프-레벨 작업을 위한 코드를 수집 분석함으로써 예측 모델에 대한 소스 코드를 생성할 수 있다. 모델이 다른 언어로부터의 코드를 통합하거나 언어가 사용자의 원하는 언어와 상이한 경우, 배포 엔진(140)은 보다 복잡한 접근법을 사용할 수 있다.For code-based prediction, deployment engine 140 may generate source code for calculating predictions based on a particular model, and users may incorporate the source code into other software. When the model is based on a technology whose leaf-level operations are all implemented in the same programming language as requested by the user, the placement engine 140 collects and analyzes the code for the leaf-level operations, thereby providing a source for the predictive model. code can be generated. If the model incorporates code from other languages or the language is different from the user's desired language, the deployment engine 140 may use a more complex approach.

하나의 접근법은 소스-대-소스 컴파일러를 사용하여 리프-레벨 적업의 소스 코드를 타겟 언어로 번역하는 것이다. 다른 접근법은 타겟 언어로 함수 스터브(stub)를 생성한 다음 원래 언어로 링크-인(link-in)된 객체 코드를 호출하거나 이러한 객체 코드를 실행하는 에뮬레이터에 액세스하는 것이다. 전자의 접근법은 사용자의 타겟 컴퓨팅 플랫폼을 위해 특별히 객체 코드를 생성하기 위해 크로스-컴파일러를 사용하는 것을 포함할 수 있다. 후자의 접근법은 사용자의 타겟 플랫폼 상에서 실행될 에뮬레이터의 사용을 포함할 수 있다.One approach is to use a source-to-source compiler to translate the source code of the leaf-level work into the target language. Another approach is to create a function stub in the target language and then call the object code linked-in to the original language or access an emulator that executes this object code. The former approach may include using a cross-compiler to generate object code specifically for the user's target computing platform. The latter approach may involve the use of an emulator that will run on the user's target platform.

또 다른 접근법은 특정 모델의 추상적 설명을 생성한 다음 그 설명을 타겟 언어로 컴파일하는 것이다. 추상적 설명을 생성하기 위해, 배치 엔진(140)의 일부 실시예는 다수의 잠재적 사전-프로세싱, 모델-맞춤 및 사후-프로세싱 단계를 설명하기 위한 메타-모델을 사용할 수 있다. 그 후 배치 엔진은 완전한 모델에 대한 특정 동작을 추출하고 메타-모델을 사용하여 이를 인코딩할 수 있다. 이러한 실시예에서, 타겟 프로그래밍 언어에 대한 컴파일러는 메타-모델을 타겟 언어로 번역하는 데 사용될 수 있다. 따라서, 사용자가 지원되는 언어로 예측 코드를 원하면, 컴파일러가 이를 생성할 수 있다. 예를 들어, 결정-트리 모델에서, 트리의 결정은 광범위하게 다양한 프로그래밍 언어로 직접 구현 가능한 논리적 if/then/else 구문으로 추상화될 수 있다. 마찬가지로, 통상적인 프로그래밍 언어에서 지원되는 수학 연산의 세트가 선형 회귀 모델을 구현하는 데 사용될 수 있다.Another approach is to generate an abstract description of a particular model and then compile that description into the target language. To generate abstract descriptions, some embodiments of deployment engine 140 may use meta-models to describe a number of potential pre-processing, model-fitting, and post-processing steps. The batch engine can then extract the specific behavior for the complete model and encode it using the meta-model. In such an embodiment, a compiler for the target programming language may be used to translate the meta-model into the target language. So, if a user wants predictive code in a supported language, the compiler can generate it. For example, in the decision-tree model, the decision of a tree can be abstracted into logical if/then/else constructs that can be directly implemented in a wide variety of programming languages. Likewise, any set of mathematical operations supported in conventional programming languages may be used to implement a linear regression model.

그러나, 임의의 언어로 예측 모델의 소스 코드를 공개하는 것은 몇몇 경우에(예를 들어, 예측 모델링 기술 또는 예측 모델이 독점 기능 또는 정보를 포함하는 경우) 바람직하지 않을 수 있다. 따라서, 배치 엔진(140)은 그 절차의 상세 사항을 공개하지 않고 예측 모델의 예측 기능을 보존하는 규칙의 세트로 예측 모델을 변환할 수 있다. 하나의 접근법은 가상적인 관측치에 대한 응답으로 예측 모델이 생성되는 가상의 예측의 세트로부터 이러한 규칙을 생성하는 알고리즘을 적용하는 것이다. 이러한 일부 알고리즘은 예측을 수행하기 위한 if-then 규칙의 세트(예를 들어, RuleFit)를 생성할 수 있다. 이 알고리즘들에 대해, 배치 엔진(140)은 원래의 예측 모델을 변환하는 대신 결과적인 if-then 규칙을 타겟 언어로 변환할 수 있다. 예측 모델을 if-then 규칙의 세트로 변환하는 추가적인 이점은, 프로그래밍 언어들에서 조건부 로직의 기본 모델이 보다 유사하기 때문에, 임의의 제어 및 데이터 흐름을 갖는 예측 모델보다 if-then 규칙의 세트를 타겟 프로그래밍 언어로 변환하는 것이 일반적으로 더 용이하다는 것이다.However, disclosing the source code of a predictive model in any language may be undesirable in some cases (eg, where predictive modeling techniques or predictive models contain proprietary functionality or information). Accordingly, the placement engine 140 may transform the predictive model into a set of rules that preserve the predictive function of the predictive model without disclosing the details of its procedures. One approach is to apply an algorithm that generates these rules from a set of hypothetical predictions in which a predictive model is generated in response to hypothetical observations. Some of these algorithms can generate a set of if-then rules (eg RuleFit) for performing predictions. For these algorithms, the placement engine 140 may convert the resulting if-then rule into the target language instead of transforming the original predictive model. An additional advantage of transforming a predictive model into a set of if-then rules is that it targets a set of if-then rules rather than a predictive model with arbitrary control and data flow, since the underlying model of conditional logic in programming languages is more similar. Converting to a programming language is usually easier.

모델이 새로운 관측치에 대한 예측을 일단 시작하면, 배치 엔진(140)은 이들 예측을 추적하고, 그 정확도를 측정하며, 예측 모델링 시스템(100)을 개선하기 위해 이들 결과를 사용할 수 있다. 서비스-기반 예측의 경우에, 시스템의 나머지 부분과 동일한 분산 컴퓨팅 환경 내에서 예측이 발생하므로, 각 관측치 및 예측은 데이터 서비스 층을 통해 저장될 수 있다. 각각의 예측에 대한 식별자를 제공함으로써, 일부 실시예는, 사용자 또는 외부 소프트웨어 시스템이 기록된 경우 및 그 때의 실제 값을 제출하도록 허용할 수 있다. 코드-기반 예측의 경우, 일부 실시예는 로컬 시스템 또는 데이터 서비스 층의 인스턴스로 다시 관측치 및 예측을 저장하는 코드를 포함할 수 있다. 다시, 각 예측에 대한 식별자를 제공하는 것은 이용 가능하게 될 때 실제 타겟 값에 대한 모델 성능 데이터의 수집을 용이하게 할 수 있다.Once the model starts making predictions for new observations, the placement engine 140 can track these predictions, measure their accuracy, and use these results to improve the predictive modeling system 100 . In the case of service-based prediction, each observation and prediction can be stored through a data service layer, as the prediction occurs within the same distributed computing environment as the rest of the system. By providing an identifier for each prediction, some embodiments may allow the user or external software system to submit the actual value as and when recorded. For code-based prediction, some embodiments may include code to store observations and predictions back to an instance of the local system or data service layer. Again, providing an identifier for each prediction can facilitate the collection of model performance data for actual target values as they become available.

예측의 정확도 및/또는 다른 채널을 통해 얻어진 관측치에 대해 배치 엔진(140)에 의해 직접 수집된 정보는 (예를 들어, 기존 모델을 "리프레시"하기 위해, 또는 모델링 검색 공간의 일부 또는 전체를 다시 탐색함으로써 모델을 생성하기 위해) 예측 문제에 대한 모델을 개선하는 데 사용될 수 있다. 모델을 생성하기 위해 원래 데이터가 추가되었던 것과 동일한 방식으로 모델을 개선하거나 이전에 예측에 사용된 데이터에 대한 타겟 값을 제출함으로써 새 데이터가 추가될 수 있다.Information gathered directly by the placement engine 140 about the accuracy of predictions and/or observations obtained through other channels (e.g., to "refresh" an existing model, or to rebuild some or all of the modeling search space) It can be used to improve the model for prediction problems (to create a model by exploring). New data can be added by improving the model in the same way that the original data was added to create the model, or by submitting target values for data previously used for prediction.

일부 모델은 대응하는 모델링 기술을 새로운 데이터에 적용하고 결과적인 새로운 모델을 기존 모델과 조합함으로써 리프레시(예를 들어, 재맞춤)될 수 있는 반면, 다른 모델은 대응하는 모델링 기술을 원본 및 새로운 데이터의 조합에 적용함으로써 리프레시될 수 있다. 일부 실시예에서, 모델을 리프레시하는 경우, 모델 파라미터 중 일부만이 (예를 들어, 모델을 보다 신속하게 리프레시하기 위해, 또는 새로운 데이터가 특정 파라미터에 특히 관련된 정보를 제공하기 때문에) 재계산될 수 있다.Some models can be refreshed (e.g., refitted) by applying corresponding modeling techniques to new data and combining the resulting new model with existing models, while others apply corresponding modeling techniques to the original and new data. It can be refreshed by applying it to a combination. In some embodiments, when refreshing the model, only some of the model parameters may be recalculated (e.g., to refresh the model more quickly, or because new data provides information that is particularly relevant to a particular parameter). .

대안적으로 또는 추가적으로, 데이터 세트에 포함된 새로운 데이터와 함께 부분적으로 또는 전체적으로 모델링 검색 공간을 탐색하는 새로운 모델이 생성될 수 있다. 검색 공간의 재탐색은 (예를 들어, 원래의 검색에서 잘 수행된 모델링 기술에 제한된) 검색 공간의 일부로 제한되거나 전체 검색 공간을 커버할 수 있다. 두 경우 모두 배치된 모델(들)을 생성한 모델링 기술(들)에 대한 초기 적절성 스코어는 예측 문제에 대한 배치된 모델(들)의 성능을 반영하도록 재계산될 수 있다. 사용자는 재계산을 수행하기 위해 일부 이전 데이터를 제외하도록 선택할 수 있다. 배치 엔진(140)의 일부 실시예는 데이터의 서브세트가 어떤 버전을 트레이닝하는 데 사용되었는지를 포함하여, 동일한 로직 모델의 상이한 버전을 추적할 수 있다.Alternatively or additionally, a new model may be created that partially or fully explores the modeling search space with new data included in the data set. A rescan of the search space may be limited to a portion of the search space (eg, limited to modeling techniques that performed well in the original search) or may cover the entire search space. In either case, the initial relevance score for the modeling technique(s) that produced the deployed model(s) may be recalculated to reflect the performance of the deployed model(s) on the prediction problem. Users can choose to exclude some old data to perform a recalculation. Some embodiments of placement engine 140 may keep track of different versions of the same logic model, including which subsets of data were used to train.

일부 실시예에서, 이 예측 데이터는 시간 경과에 따른 입력 파라미터 또는 예측 자체의 경향의 사후-요청 분석을 수행하고, 모델 예측의 입력 또는 품질에 대한 잠재적 문제점을 사용자에게 경고하기 위해 사용될 수 있다. 예를 들어, 모델 성능의 집계 측정이 시간이 지남에 따라 저하되기 시작하면, 시스템은 모델을 리프레싱하거나 입력 자체가 시프팅하고 있는지 여부를 조사하는 것을 고려하도록 사용자에게 경고할 수 있다. 이러한 시프트는 특정 변수의 시간적 변화 또는 전체 모집단의 표류에 의해 야기될 수 있다. 일부 실시예에서, 이 분석의 대부분은 예측 응답이 느려지는 것을 피하기 위해 예측 요청이 완료된 후에 수행된다. 그러나, 시스템은 (예를 들어, 입력값이 원래 트레이닝 데이터, 모델링 기술 및 최종 모델 맞춤 상태의 유효한 주어진 특성으로서 연산된 값의 범위 외부에 있는 경우) 특히 나쁜 예측을 피하도록 예측 시간에 일부 검증을 수행할 수 있다.In some embodiments, this prediction data may be used to perform post-request analysis of the trend of the input parameters or predictions themselves over time, and to alert the user to potential problems with the input or quality of the model predictions. For example, if an aggregate measure of model performance begins to degrade over time, the system may alert the user to consider refreshing the model or examining whether the input itself is shifting. Such shifts can be caused by temporal changes in specific variables or by drifting of the entire population. In some embodiments, most of this analysis is performed after the prediction request is completed to avoid slowing down the prediction response. However, the system does some validation at prediction time to avoid particularly bad predictions (e.g., if the input is outside the range of values computed as valid given properties of the original training data, modeling technique, and final model fit state). can be done

사후 분석은, 사용자가 트레이닝에 사용된 모집단을 훨씬 초과하는 외삽법을 만들기 위해 모델을 배치한 경우에 중요할 수 있다. 예를 들어, 모델이 한 지리적 영역으로부터의 데이터에 대해 트레이닝되었지만, 완전히 다른 지리적 영역의 모집단에 대해 예측을 수행하는 데 사용될 수 있다. 때로는, 새로운 모집단에 대한 이러한 외삽법은 예상보다 실질적으로 더 나쁜 모델 성능으로 귀결될 수 있다. 이러한 경우에, 배치 엔진(140)은 사용자에게 경고하고 및/또는 새로운 값을 사용하여 하나 이상의 모델링 기술을 다시 맞춤화함으로써 모델을 자동으로 리프레시할 수 있고 원래 트레이닝 데이터를 확장한다.Post hoc analysis can be important if the user has deployed the model to make extrapolations far beyond the population used for training. For example, although a model has been trained on data from one geographic area, it can be used to make predictions on a population in an entirely different geographic area. Sometimes, this extrapolation to a new population can result in substantially worse model performance than expected. In such a case, the placement engine 140 may automatically refresh the model and extend the original training data by alerting the user and/or re-customizing one or more modeling techniques using the new values.

일부 실시예의 이점Advantages of some embodiments

예측 모델링 시스템(100)은 임의의 스킬 레벨에서 분석자의 생산성을 상당히 개선할 수 있고 및/또는 주어진 양의 자원으로 달성 가능한 예측 모델의 정확도를 상당히 증가시킬 수 있다. 프로시저를 자동화하는 것은 작업 부하를 감소시키고, 프로세스를 시스템화하는 것은 일관성을 유지할 수 있어, 분석자가 고유한 직관을 생성하는 데 더 많은 시간을 쓸 수 있게 한다. 3개의 통상적인 시나리오가 결과 예상, 특성 예측 및 측정 추론과 같은 이점을 나타낸다.The predictive modeling system 100 can significantly improve the productivity of an analyst at any skill level and/or can significantly increase the accuracy of predictive models achievable with a given amount of resources. Automating procedures reduces the workload, and systematizing processes can be consistent, allowing analysts to spend more time generating unique intuitions. Three typical scenarios show benefits such as outcome prediction, feature prediction, and measurement inference.

결과 예상Expect results

조직이 결과를 정확하게 예상할 수 있다면, 조직은 보다 효율적으로 계획하고 그 거동을 향상시킬 수 있다. 따라서, 머신 학습의 통상적인 애플리케이션은 예상을 생성하는 알고리즘을 개발하는 것이다. 예를 들어, 많은 산업 분야는 대규모의 시간-소모적인 프로젝트에서 비용 예측 문제에 직면한다.If an organization can accurately predict outcomes, it can plan more efficiently and improve its behavior. Thus, a common application of machine learning is to develop algorithms that generate predictions. For example, many industries face the problem of cost forecasting on large-scale, time-consuming projects.

일부 실시예에서, 본원에 설명되는 기술은 비용 초과(예를 들어, 소프트웨어 비용 초과 또는 건설 비용 초과)를 예상하기 위해 사용될 수 있다. 예를 들어, 본원에 설명되는 기술은 다음과 같이 비용 초과를 예상하는 문제에 적용될 수 있다:In some embodiments, the techniques described herein may be used to anticipate cost overruns (eg, over software costs or over construction costs). For example, the techniques described herein can be applied to the problem of predicting cost overruns as follows:

1. 응답 변수 유형(예를 들어, 숫자 또는 바이너리, 근사적으로 가우시안(Gaussian) 또는 강한 논(non)-가우시안)에 적절한 모델 맞춤 메트릭을 선택한다: 예측 모델링 시스템(100)은 사용자에 의한 더 적은 기술 및 노력을 필요로 하는, 데이터 특성에 기초한 메트릭을 추천할 수 있지만, 사용자가 최종 선택을 할 수 있게 한다.1. Select the appropriate model fit metric for the response variable type (e.g., numeric or binary, approximately Gaussian or strongly non-Gaussian): predictive modeling system 100 It can recommend metrics based on data characteristics, requiring little skill and effort, but allows the user to make the final choice.

2. 아웃라이어 및 누락 데이터값을 해결하기 위해 데이터를 사전-프로세싱한다: 예측 모델링 시스템(100)은 데이터 특성의 상세한 개요를 제공하여, 사용자가 모델링 문제의 보다 나은 상황 인식을 개발하고 잠재적인 모델링 과제를 보다 효과적으로 평가할 수 있게 한다. 예측 모델링 시스템(100)은 아웃라이어 검출 및 대체, 누락값 전가(imputation), 및 사용자에 의한 기술 및 노력을 덜 요구하는 다른 데이터 이상의 검출 및 처리를 위한 자동화된 절차를 포함할 수 있다. 이러한 과제를 해결하기 위한 예측 모델링 시스템의 절차는 체계적일 수 있어, 임시 데이터 편집 절차보다 방법, 데이터 세트 및 시간에 걸쳐 보다 일관된 모델링 결과로 이어질 수 있다.2. Pre-process the data to resolve outliers and missing data values: The predictive modeling system 100 provides a detailed overview of data characteristics, allowing users to develop better situational awareness of modeling problems and potential modeling Allows for more effective evaluation of tasks. Predictive modeling system 100 may include automated procedures for detection and processing of outliers detection and replacement, imputation of missing values, and other data anomalies that require less skill and effort by the user. Procedures for predictive modeling systems to address these challenges can be systematic, leading to more consistent modeling results across methods, data sets, and time than ad hoc data editing procedures.

3. 모델링 및 평가를 위해 데이터를 분할한다: 예측 모델링 시스템(100)은 데이터를 트레이닝, 검증 및 홀드아웃 세트로 자동 분할할 수 있다. 이 분할은 일부 데이터 분석자에 의해 사용되는 트레인 및 테스트 분할보다 더욱 유연할 수 있으며, 머신 학습 커뮤니티로부터 광범위하게 수용되는 추천과 일관될 수 있다. 방법, 데이터 세트 및 시간에 걸쳐 일관된 분할 접근법의 사용은 결과를 보다 쉽게 비교 가능하게 만들 수 있어, 상업적 상황에서 자원 배치의 보다 효과적인 할당을 가능하게 한다.3. Partition data for modeling and evaluation: The predictive modeling system 100 can automatically partition data into training, validation, and holdout sets. This partitioning may be more flexible than the train and test partitioning used by some data analysts, and may be consistent with widely accepted recommendations from the machine learning community. The use of a partitioning approach that is consistent across methods, data sets and time can make results more easily comparable, enabling more effective allocation of resource allocations in a commercial context.

4. 모델 구조를 선택하고, 도출된 피처를 생성하고, 모델 튜닝 파라미터를 선택하고, 모델을 맞춤화하고, 평가한다: 일부 실시예에서, 예측 모델링 시스템(100)은 결정 트리, 신경 네트워크, 지원 벡터 머신 모델, 회귀 모델, 부스팅된 트리, 랜덤 포리스트, 심층 학습(deep learning) 신경망 등에 한정되지 않지만 이를 포함하는 많은 상이한 모델 유형을 맞춤화할 수 있다. 예측 모델링 시스템(100)은 최상의 개별 성능을 나타내는 이러한 컴포넌트 모델로부터 앙상블을 자동으로 구성하는 옵션을 제공할 수 있다. 잠재적인 모델의 더 큰 공간을 탐색하는 것은 정확도를 향상시킬 수 있다. 예측 모델링 시스템은 상이한 데이터 유형(예를 들어, Box-Cox 변환, 텍스트 사전-프로세싱, 주요 컴포넌트 등)에 적절한 다양한 도출된 피처를 자동으로 생성할 수 있다. 잠재적인 변환의 더 큰 공간을 탐색하는 것은 정확도를 향상시킬 수 있다. 예측 모델링 시스템(100)은 상호 유효성 검증을 사용하여 모델 구축 프로세스의 일부로서 이들 튜닝 파라미터에 대한 최상의 값을 선택할 수 있고, 이에 의해 튜닝 파라미터의 선택을 개선하고 파라미터의 선택이 결과에 어떻게 영향을 주는지에 대한 감사 추적을 생성할 수 있다. 예측 모델링 시스템(100)은 이 자동화된 프로세스의 일부로서 고려되는 상이한 모델 구조를 맞춤화하고 평가할 수 있어, 검증 세트 성능의 관점에서 결과를 등급화한다.4. Select model structures, generate derived features, select model tuning parameters, customize models, and evaluate: In some embodiments, predictive modeling system 100 includes decision trees, neural networks, support vectors Many different model types can be customized, including but not limited to machine models, regression models, boosted trees, random forests, deep learning neural networks, and the like. The predictive modeling system 100 may provide an option to automatically construct an ensemble from these component models that exhibit the best individual performance. Exploring a larger space of potential models can improve accuracy. The predictive modeling system can automatically generate various derived features suitable for different data types (eg, Box-Cox transformations, text pre-processing, key components, etc.). Exploring a larger space of potential transformations can improve accuracy. The predictive modeling system 100 can use mutual validation to select the best values for these tuning parameters as part of the model building process, thereby improving the selection of tuning parameters and how the selection of parameters affects the results. You can create an audit trail for The predictive modeling system 100 can customize and evaluate the different model structures considered as part of this automated process, ranking the results in terms of validation set performance.

5. 최종 모델을 선택한다: 최종 모델의 선택은 예측 모델링 시스템(100) 또는 사용자에 의해 이루어질 수 있다. 후자의 경우, 예측 모델링 시스템은 예를 들어, 모델에 대한 등급화된 검증 세트 성능 평가, 맞춤 프로세스에서 사용된 것과 다른 품질 측정에 의해 성능의 비교 및 등급화의 옵션, 및/또는 최상의 개별 성능을 나타내는 컴포넌트 모델로부터 앙상블 모델을 구축할 수 있는 기회를 포함하여, 사용자가 이러한 결정을 내리는 것을 돕도록 지원을 제공할 수 있다.5. Select the final model: The selection of the final model may be made by the predictive modeling system 100 or the user. In the latter case, the predictive modeling system may, for example, evaluate the performance of a graded validation set for the model, the option of comparing and grading performance by quality measures other than those used in the fitting process, and/or determining the best individual performance. Support can be provided to help users make these decisions, including the opportunity to build ensemble models from the component models they represent.

예측 모델링 시스템의 모델 개발 프로세스의 하나의 중요한 실제적인 양태는, 일단 초기 데이터 세트가 어셈블링되면, 모든 후속 연산이 동일한 소프트웨어 환경 내에서 발생할 수 있다는 것이다. 이 양태는 종종 상이한 소프트웨어 환경의 조합을 포함하는 통상의 모델-구축 노력과 중요한 차이점을 나타낸다. 이러한 다중-플랫폼 분석 접근법의 중요한 실용적인 단점은 결과를 다른 소프트웨어 환경들 간에 공유될 수 있는 공통 데이터 포맷으로 변환할 필요가 있다는 것이다. 종종 이 변환은 수동으로 또는 커스텀 "1회용" 재포맷 스크립트로 수행된다. 이 프로세스에서의 에러는 매우 심각한 데이터 왜곡으로 이어질 수 있다. 예측 모델링 시스템(100)은 하나의 소프트웨어 환경에서 모든 연산을 수행함으로써 이러한 재포맷팅 및 데이터 전송 에러를 피할 수 있다. 보다 일반적으로, 이는 매우 자동화되어 있고, 많은 상이한 모델 구조를 맞춤화하고 최적화하기 때문에, 예측 모델링 시스템(100)은 실질적으로 더 빠르고 보다 체계적이고, 따라서 보다 쉽게 설명 가능하고 보다 반복 가능하게 최종 모델로 라우팅될 수 있다. 또한, 예측 모델링 시스템(100)이 더욱 다른 모델링 방법을 탐색하고 가능한 예측자를 포함하기 때문에, 결과적인 모델은 종래의 방법에 의해 얻어진 것보다 더 정확할 수 있다.One important practical aspect of the model development process of a predictive modeling system is that once the initial data set is assembled, all subsequent operations can occur within the same software environment. This aspect represents a significant difference from conventional model-building efforts, which often involve a combination of different software environments. A significant practical disadvantage of this multi-platform analysis approach is the need to convert the results into a common data format that can be shared between different software environments. Often this conversion is done manually or with a custom "one-time" reformatting script. Errors in this process can lead to very serious data distortions. The predictive modeling system 100 can avoid such reformatting and data transmission errors by performing all operations in one software environment. More generally, because it is highly automated, and because it customizes and optimizes many different model structures, the predictive modeling system 100 is substantially faster, more systematic, and thus more easily descriptive and more repeatable routing to the final model. can be Also, as the predictive modeling system 100 explores more different modeling methods and includes possible predictors, the resulting model may be more accurate than that obtained by conventional methods.

속성 예측property prediction

많은 분야에서, 조직은 생산 프로세스의 결과에 있어서 불확실성에 직면하고 주어진 조건의 세트가 출력의 최종 속성에 어떻게 영향을 미칠지 예측하기를 원한다. 따라서, 머신 학습의 공통적인 애플리케이션은 이러한 속성을 예측하는 알고리즘을 개발하는 것이다. 예를 들어, 콘크리트는 통상적인 건축 자재로, 그 최종 구조 속성이 상황마다 크게 다를 수 있다. 시간에 따른 콘크리트 속성의 상당한 변화와 매우 가변적인 조성에 대한 의존성으로 인해, 제1 원리로부터 개발된 모델도 전통적인 회귀 모델로부터 개발된 모델도 적절한 예측 정확도를 제공하지 못한다.In many fields, organizations face uncertainty in the outcome of their production processes and want to predict how a given set of conditions will affect the final properties of the output. Therefore, a common application of machine learning is to develop algorithms to predict these properties. For example, concrete is a common building material whose final structural properties can vary significantly from situation to situation. Neither models developed from first principles nor models developed from traditional regression models provide adequate predictive accuracy because of the significant change in concrete properties over time and their dependence on highly variable composition.

일부 실시예에서, 본원에 설명되는 기술은 생산 프로세스의 결과의 속성(예를 들어, 콘크리트의 속성)을 예측하는 데 사용될 수 있다. 예를 들어, 본원에 설명되는 기술은 다음과 같은 콘크리트의 속성을 예측하는 문제에 적용될 수 있다:In some embodiments, the techniques described herein may be used to predict properties of the outcome of a production process (eg, properties of concrete). For example, the techniques described herein may be applied to the problem of predicting properties of concrete such as:

1. 트레이닝, 검증 및 테스트 서브세트로 데이터 세트를 분할한다.1. Split the data set into training, validation, and test subsets.

2. 모델링 데이터 세트를 클리닝한다: 예측 모델링 시스템(100)은 누락 데이터, 아웃라이어 및 다른 중요한 데이터 이상을 자동으로 확인하고, 처리 전략을 추천하고, 사용자에게 이들을 수용 또는 거절하는 옵션을 제공할 수 있다. 이 접근법은 사용자에 의한 적은 기술과 노력을 요구할 수 있고, 및/또는 방법, 데이터 세트 및 시간에 걸쳐 보다 일관된 결과를 제공할 수 있다.2. Clean the modeling data set: The predictive modeling system 100 can automatically identify missing data, outliers and other important data anomalies, recommend processing strategies, and provide the user the option to accept or reject them. have. This approach may require less skill and effort by the user, and/or may provide more consistent results across methods, data sets and time.

3. 응답 변수를 선택하고 1차 맞춤 메트릭을 선택한다: 사용자는 모델링 데이터 세트에서 이용 가능한 것으로부터 예측되는 응답 변수를 선택할 수 있다. 일단 응답 변수가 선택되었으면, 예측 모델링 시스템(100)은, 사용자가 수용하거나 무시할 수 있는 호환 가능한 맞춤 메트릭을 추천할 수 있다. 이 접근법은 사용자에 의한 적은 기술과 노력을 요구할 수 있다. 예측 모델링 시스템은, 응답 변수 유형과 선택된 맞춤 메트릭에 기초하여, 통상의 회귀 모델, 신경망 및 다른 머신 학습 모델(예를 들어, 랜덤 포리스트, 부스팅된 트리, 지원 벡터 머신)을 포함하는 예측 모델의 세트를 제공할 수 있다. 가능한 모델링 접근법의 공간 중에서 자동 검색함으로써, 예측 모델링 시스템(100)은 최종 모델의 예상 정확도를 증가시킬 수 있다. 모델 선택의 디폴트 세트는 특정 모델 유형을 고려 대상에서 제외하거나, 예측 모델링 시스템에 의해 지원되지만 디폴트 목록의 일부가 아닌 다른 모델 유형을 추가하거나, 사용자 자신의 커스텀 모델 유형(예를 들어, R 또는 파이썬(Python)으로 구현됨)을 추가하기 위해 무시될 수 있다.3. Select a response variable and select a first-order fit metric: the user can select a predicted response variable from what is available in the modeling data set. Once the response variables have been selected, the predictive modeling system 100 can recommend compatible custom metrics that the user can accept or ignore. This approach may require little skill and effort by the user. A predictive modeling system is a set of predictive models, including conventional regression models, neural networks, and other machine learning models (eg, random forests, boosted trees, support vector machines), based on response variable types and selected custom metrics. can provide By automatically searching among the space of possible modeling approaches, the predictive modeling system 100 can increase the predictive accuracy of the final model. A default set of model selections can exclude certain model types from consideration, add other model types supported by the predictive modeling system but not part of the default list, or create your own custom model types (e.g. R or Python). (implemented in Python) can be overridden.

4. 입력 피처를 생성하고, 모델을 맞춤화하고, 모델-특유 튜닝 파라미터를 최적화하고, 성능을 평가한다: 일부 실시예에서, 피처 생성은 수치 공변량, Box-Cox 변환, 주요 컴포넌트 등을 위한 스케일링을 포함할 수 있다. 모델에 대한 파라미터를 튜닝하는 것은 상호 유효성 검증을 통해 최적화될 수 있다. 검증 세트 성능 측정은 다른 개요 특성(예를 들어, 회귀 모델에 대한 모델 파라미터, 부스트된 트리 또는 랜덤 포리스트에 대한 변수 중요도 측정)과 함께 각 모델에 대해 연산되고 제시될 수 있다.4. Generate input features, customize the model, optimize model-specific tuning parameters, and evaluate performance: In some embodiments, feature creation allows scaling for numerical covariates, Box-Cox transforms, key components, etc. may include Tuning parameters for the model can be optimized through mutual validation. Validation set performance measures may be computed and presented for each model along with other overview properties (eg, model parameters for regression models, variable importance measures for boosted trees or random forests).

5. 최종 모델을 선택한다: 최종 모델의 선택은 예측 모델링 시스템(100) 또는 사용자에 의해 이루어질 수 있다. 후자의 경우, 예측 모델링 시스템은 예를 들어, 모델에 대한 등급화된 검증 세트 성능 평가, 맞춤 프로세스에서 사용된 것과 다른 품질 측정에 의한 성능을 비교 및 등급화하는 옵션, 및/또는 최상의 개별 성능을 나타내는 이러한 컴포넌트 모델로부터 앙상블 모델을 구축할 수 있는 기회를 포함하여 사용자가 결정을 내리는 것을 돕도록 지원을 제공할 수 있다.5. Select the final model: The selection of the final model may be made by the predictive modeling system 100 or the user. In the latter case, the predictive modeling system may, for example, evaluate the performance of a graded validation set for the model, the option to compare and grade performance by other quality measures than those used in the fitting process, and/or determine the best individual performance. Support can be provided to help users make decisions, including the opportunity to build ensemble models from these component models they represent.

측정의 추론inference of measurement

일부 측정은 다른 측정보다 비용이 훨씬 많이 들기 때문에, 조직은 보다 비싼 메트릭에 대해 보다 저렴한 메트릭으로 대체하기를 원할 수 있다. 따라서, 머신 학습의 통상적인 애플리케이션은 보다 저렴한 것의 알려진 출력으로부터 고가의 측정 가능성 있는 출력을 추론하는 것이다. 예를 들어, "컬(curl)"은 종이 제품이 평면 형상에서 벗어나는 경향을 포착하는 속성이지만, 이는 통상적으로 제품이 완성된 후에만 판정할 수 있다. 제조 과정에서 쉽게 측정되는 기계적 속성으로부터 용지의 컬을 추론할 수 있는 것은 주어진 레벨의 품질을 달성하는 데 엄청난 비용 절감으로 귀결될 수 있다. 통상적인 최종-사용 속성에 대해, 이들 속성과 제조 프로세스 조건 사이의 관계는 잘 이해되지 않는다.Because some measures cost much more than others, organizations may want to substitute a cheaper metric for a more expensive metric. Thus, a common application of machine learning is to infer an expensive measurable output from the known output of a cheaper one. For example, "curl" is an attribute that captures the tendency of a paper product to deviate from its planar shape, but this can usually only be determined after the product is finished. Being able to infer the curl of a paper from mechanical properties that are easily measured during manufacturing can result in significant cost savings in achieving a given level of quality. For typical end-use attributes, the relationship between these attributes and manufacturing process conditions is not well understood.

일부 실시예에서, 본원에 설명되는 기술은 측정을 추론하는 데 사용될 수 있다. 예를 들어, 본원에 설명되는 기술은 다음과 같이 측정을 추론하는 문제에 적용될 수 있다:In some embodiments, the techniques described herein can be used to infer measurements. For example, the techniques described herein can be applied to the problem of inferring measurements as follows:

1. 모델링 데이터 세트를 특성화한다: 예측 모델링 시스템(100)은 주요 개요 특성을 제공하고, 사용자가 자유롭게 수용하거나, 거절하거나, 더 많은 정보를 요청하는 중요한 데이터 이상의 처리에 대한 추천을 제공할 수 있다. 예를 들어, 변수의 주요 특성이 연산되고 표시될 수 있으며, 누락 데이터의 출현이 표시될 수 있고, 처리 전략을 추천될 수 있으며, 수치 변수의 아웃라이어가 검출될 수 있고, 발견되면, 처리 전략이 추천될 수 있고 및/또는 기타 데이터 이상이 자동으로 검출될 수 있고(예를 들어, 인라이어, 그 값이 변경되지 않는 정보가 없는 변수), 추천된 처리가 사용자에게 이용 가능하게 될 수 있다.1. Characterize the modeling data set: The predictive modeling system 100 provides key overview characteristics and can provide recommendations for processing over critical data that the user is free to accept, reject, or request more information. . For example, a key characteristic of a variable may be calculated and displayed, the appearance of missing data may be indicated, a processing strategy may be recommended, an outlier of a numeric variable may be detected and, if found, a processing strategy These recommendations may be made and/or other data anomalies may be automatically detected (eg, inliers, informational variables whose values do not change), and recommended actions may be made available to the user. .

2. 트레이닝/검증/홀드아웃 서브세트로 데이터 세트를 분할한다.2. Split the data set into training/validation/holdout subsets.

3. 피처 생성/모델 구조 선택/모델 맞춤: 예측 모델링 시스템(100)은 이들 단계를 조합 및 자동화하여 넓은 내부 반복을 허용할 수 있다. 주요 컴포넌트와 같은 고전적인 기술과 부스팅된 트리와 같은 더 새로운 방법 모두를 사용하여 복수의 피처가 자동으로 생성 및 평가될 수 있다. 회귀 모델, 신경망, 지원 벡터 머신, 랜덤 포리스트, 부스팅된 트리 및 다른 것을 포함하여 많은 상이한 모델 유형이 맞춤화 및 비교될 수 있다. 또한, 사용자는 이 디폴트 컬렉션의 일부가 아닌 다른 모델 구조를 포함하는 옵션을 가질 수 있다. 모델 하위-구조 선택(예를 들어, 신경망의 숨은 유닛의 수의 선택, 다른 모델-특유 튜닝 파라미터의 특정 등)은 이 모델 맞춤 및 평가 프로세스의 일부로서 넓은 상호 유효성 검증에 의해 자동으로 수행될 수 있다.3. Feature Creation/Model Structure Selection/Model Fit: The predictive modeling system 100 can combine and automate these steps to allow for wide internal iterations. Multiple features can be automatically created and evaluated using both classic techniques such as key components and newer methods such as boosted trees. Many different model types can be customized and compared, including regression models, neural networks, support vector machines, random forests, boosted trees and others. Additionally, the user may have the option to include other model structures that are not part of this default collection. Model sub-structure selection (e.g., selection of the number of hidden units of a neural network, specification of other model-specific tuning parameters, etc.) can be done automatically by broad mutual validation as part of this model fitting and evaluation process. have.

4. 최종 모델을 선택한다: 최종 모델의 선택은 예측 모델링 시스템(100) 또는 사용자에 의해 이루어질 수 있다. 후자의 경우, 예측 모델링 시스템은, 예를 들어, 모델에 대한 등급화된 검증 세트 성능 평가, 맞춤 프로세스에서 사용된 것과 다른 품질 측정에 의한 성능의 비교 및 등급화의 옵션, 및/또는 최상의 개별 성능을 나타내는 이러한 컴포넌트 모델로부터 앙상블 모델을 구축할 수 있는 기회를 포함하여, 사용자가 이러한 결정을 내리는 것을 돕도록 지원을 제공할 수 있다.4. Select the final model: The selection of the final model may be made by the predictive modeling system 100 or the user. In the latter case, the predictive modeling system may, for example, evaluate the performance of a graded validation set for the model, the option of comparing and grading performance by quality measures other than those used in the fitting process, and/or the best individual performance. Support can be provided to help users make these decisions, including the opportunity to build ensemble models from these component models that represent

일부 실시예에서는, 예측 모델링 시스템(100)이 데이터 사전 프로세싱(예를 들어, 비정상 검출), 데이터 분할, 다중 피처 생성, 모델 맞춤 및 모델 평가를 자동화하고 효율적으로 구현하기 때문에, 모델을 개발하는 데 필요한 시간은 종래의 개발 사이클에서 이루어지는 것보다 훨씬 더 짧을 수 있다. 또한, 일부 실시예에서, 예측 모델링 시스템은 누락 데이터 및 아웃라이어와 같은 공지의 데이터 이상과, 인라이어와 같은 덜 폭넓게 이해되는 이상(데이터 분포와 일관되지만 에러인 반복되는 관측치) 및 포스트딕터(postdictor)(즉, 정보 누락으로부터 발생하는 극도로 예측적인 공변량) 모두를 취급하기 위해 데이터 후-처리 절차를 자동으로 포함하기 때문에, 결과적인 모델이 더 정확하고 더 유용할 수 있다. 일부 실시예에서, 예측 모델링 시스템(100)은 종래의 실현 가능한 것보다 훨씬 광범위한 모델 유형 및 각 유형의 더 많은 특정 모델을 탐색할 수 있다. 이 모델의 다양성은 손상된 품질의 데이터 세트에 적용된 경우에도 불만족스러운 결과가 발생할 가능성을 크게 감소시킬 수 있다.In some embodiments, the predictive modeling system 100 automates and efficiently implements data pre-processing (eg, anomaly detection), data segmentation, multi-feature generation, model fitting, and model evaluation, so The time required may be much shorter than would be achieved in a conventional development cycle. Additionally, in some embodiments, the predictive modeling system is capable of detecting known data anomalies, such as missing data and outliers, and less broadly understood anomalies such as inliers (repeated observations consistent with data distribution but in error) and postdictors. ) (i.e., extremely predictive covariates resulting from missing information), the resulting model may be more accurate and useful because it automatically includes a data post-processing procedure to handle both. In some embodiments, the predictive modeling system 100 may explore a much broader range of model types and more specific models of each type than is conventionally feasible. The versatility of this model can greatly reduce the likelihood of unsatisfactory results even when applied to data sets of compromised quality.

예측 모델 시스템의 구현Implementation of a predictive model system

도 5를 참조하면, 일부 실시예에서, 예측 모델링 시스템(500)(예를 들어, 예측 모델링 시스템(100)의 일 실시예)은 적어도 하나의 클라이언트 컴퓨터(510), 적어도 하나의 서버(550) 및 하나 이상의 프로세싱 노드(570)를 포함한다. 나타내어진 구성은 단지 예시적인 목적을 위한 것이며, 임의의 수의 클라이언트(510) 및/또는 서버(550)가 있을 수 있는 것으로 의도된다.5 , in some embodiments, predictive modeling system 500 (eg, one embodiment of predictive modeling system 100 ) includes at least one client computer 510 , at least one server 550 . and one or more processing nodes 570 . The configuration shown is for illustrative purposes only, and it is intended that there may be any number of clients 510 and/or servers 550 .

일부 실시예에서, 예측 모델링 시스템(500)은 방법(300)의 하나 이상의(예를 들어, 전부) 단계를 수행할 수 있다. 일부 실시예에서, 클라이언트(510)는 사용자 인터페이스(120)를 구현할 수 있고, 서버(550)의 예측 모델링 모듈(552)은 예측 모델링 시스템(100)의 다른 컴포넌트(예를 들어, 모델링 공간 탐색 엔진(110), 모델링 기술의 라이브러리(130), 예측 문제의 라이브러리 및/또는 모델링 배치 엔진(140))를 구현할 수 있다. 일부 실시예에서, 모델링 검색 공간의 탐색을 위해 탐색 엔진(110)에 의해 할당된 연산 자원은 하나 이상의 프로세싱 노드(570)의 자원일 수 있고, 하나 이상의 프로세싱 노드(570)는 자원 할당 스케줄에 따라 모델링 기술을 실행할 수 있다. 그러나, 실시예는, 예측 모델링 시스템(100) 또는 예측 모델링 방법(300)의 컴포넌트가 클라이언트(510), 서버(550) 및 하나 이상의 프로세싱 노드(570) 사이에 분산되는 방식에 의해 제한되지 않는다. 또한, 일부 실시예에서, 예측 모델링 시스템(100)의 모든 컴포넌트는 (클라이언트(510), 서버(550) 및 프로세싱 노드(들)(570) 사이에 분산되는 대신에) 단일 컴퓨터 상에 구현될 수 있거나 2개의 컴퓨터(예를 들어, 클라이언트(510) 및 서버(550)) 상에 구현될 수 있다.In some embodiments, predictive modeling system 500 may perform one or more (eg, all) steps of method 300 . In some embodiments, the client 510 may implement the user interface 120 , and the predictive modeling module 552 of the server 550 may be configured with another component of the predictive modeling system 100 (eg, a modeling space search engine). 110, a library of modeling techniques 130, a library of prediction problems, and/or a modeling placement engine 140). In some embodiments, the computational resources allocated by the search engine 110 for the search of the modeling search space may be those of one or more processing nodes 570, and the one or more processing nodes 570 according to a resource allocation schedule. Able to practice modeling techniques. However, embodiments are not limited by the manner in which components of predictive modeling system 100 or predictive modeling method 300 are distributed among clients 510 , server 550 and one or more processing nodes 570 . Also, in some embodiments, all components of predictive modeling system 100 may be implemented on a single computer (instead of being distributed among clients 510 , server 550 and processing node(s) 570 ). or implemented on two computers (eg, client 510 and server 550 ).

하나 이상의 통신 네트워크(530)가 클라이언트(510)를 서버(550)와 접속시키고, 하나 이상의 통신 네트워크(580)가 서버(550)를 프로세싱 노드(들)(570)와 접속시킨다. 통신은 표준 전화 회선, LAN 또는 WAN 링크(예를 들어, T1, T3, 56kb, X.25), 광대역 접속(ISDN, 프레임 릴레이, ATM) 및/또는 무선 링크(IEEE 802.11, Bluetooth)와 같은 임의의 매체를 통해 이루어질 수 있다. 바람직하게는, 네트워크(530/580)는 TCP/IP 프로토콜 통신을 수행할 수 있고 클라이언트(510), 서버(550) 및 프로세싱 노드(들)(570)에 의해 송신된 데이터(예를 들어, HTTP/HTTPS 요청 등)는 TCP/IP 네트워크와 같은 것을 통해 통신될 수 있다. 그러나, 네트워크 유형은 제한되지 않고, 임의의 적절한 네트워크가 사용될 수 있다. 통신 네트워크(530/580)로서 역할을 할 수 있거나 그 일부가 될 수 있는 네트워크의 비한정적인 예는 무선 또는 유선 이더넷-기반 인트라넷, 로컬 또는 광역 네트워크(LAN 또는 WAN), 및/또는 많은 다른 통신 매체 및 프로토콜을 수용할 수 있는 인터넷으로 알려진 글로벌 통신 네트워크를 포함한다.One or more communication networks 530 connect the client 510 with the server 550 , and one or more communication networks 580 connect the server 550 with the processing node(s) 570 . Communication can be any standard phone line, LAN or WAN link (eg T1, T3, 56kb, X.25), broadband access (ISDN, Frame Relay, ATM) and/or wireless link (IEEE 802.11, Bluetooth). This can be done through the medium of Preferably, the network 530/580 is capable of performing TCP/IP protocol communications and data transmitted by the client 510, server 550 and processing node(s) 570 (eg, HTTP /HTTPS requests, etc.) can be communicated over something like a TCP/IP network. However, the network type is not limited, and any suitable network may be used. Non-limiting examples of networks that may serve as or be part of a communications network 530/580 include a wireless or wired Ethernet-based intranet, a local or wide area network (LAN or WAN), and/or many other communications It includes a global communications network known as the Internet that can accommodate media and protocols.

클라이언트(510)는 하드웨어 상에서 실행되는 소프트웨어(512)로 구현되는 것이 바람직하다. 일부 실시예에서, 하드웨어는 미국 워싱토주 레드몬드에 소재한 Microsoft Corporation으로부터의 운영 체제의 MICROSOFT WINDOWS 패밀리, 미국 캘리포니아주 쿠퍼티노에 소재한 Apple Computer로부터의 MACINTOSH 운영 체제, 및/또는 SUN MICROSYSTEMS로부터의 SUN SOLARIS와 같은 다양한 Unix 계열, 및 미국 노스 캐롤라이나주 더럼에 소재한 RED HAT, INC.로부터의 GNU/Linux와 같은 운영 체제와 같이 실행할 수 있는 퍼스널 컴퓨터(예를 들어, INTEL 프로세서를 갖는 PC 또는 APPLE MACINTOSH)를 포함할 수 있다. 클라이언트(510)는 또한 스마트 또는 덤(dumb) 단말, 네트워크 컴퓨터, 무선 디바이스, 무선 전화, 정보 어플라이언스, 워크스테이션, 미니컴퓨터, 메인프레임 컴퓨터, 개인용 데이터 어시스턴트, 태블릿, 스마트 폰 또는 범용 컴퓨터로서 동작되는 다른 컴퓨팅 디바이스, 또는 클라이언트(510)로서 역할을 하기 위해 단독으로 사용되는 특수 목적 하드웨어 디바이스와 같은 하드웨어 상에서 구현될 수 있다.The client 510 is preferably implemented as software 512 running on hardware. In some embodiments, the hardware is such as the MICROSOFT WINDOWS family of operating systems from Microsoft Corporation, Redmond, Wash., the MACINTOSH operating system from Apple Computer, Cupertino, CA, and/or SUN SOLARIS from SUN MICROSYSTEMS. various Unix families, and personal computers (eg, PC with INTEL processor or APPLE MACINTOSH) capable of running with an operating system such as GNU/Linux from RED HAT, INC. of Durham, NC, USA. can Client 510 may also be operated as a smart or dumb terminal, network computer, wireless device, wireless telephone, information appliance, workstation, minicomputer, mainframe computer, personal data assistant, tablet, smart phone or general purpose computer. It may be implemented on hardware, such as another computing device, or a special purpose hardware device used solely to serve as the client 510 .

일반적으로, 일부 실시예에서, 클라이언트(510)는 전자 메일 및/또는 인스턴트 메시지의 송신 및 수신, 월드 와이드 웹 상에서 이용 가능한 컨텐츠의 요청 및 시청, 채팅 룸에의 참여, 또는 컴퓨터, 휴대용 디바이스 또는 셀룰러 전화를 사용하여 통상적으로 수행되는 다른 작업을 수행하는 것을 포함하여 다양한 활동을 위해 동작 및 사용될 수 있다. 클라이언트(510)는 또한 클라이언트(510)를 그들의 고용의 일부로서 사용자에게 제공하는 고용주와 같이 다른 사람들을 대신하여 사용자에 의해 운영될 수 있다.Generally, in some embodiments, client 510 sends and receives electronic mail and/or instant messages, requests and views content available on the World Wide Web, participates in chat rooms, or uses a computer, portable device, or cellular It can be operated and used for a variety of activities, including using the phone to perform other tasks typically performed. The client 510 may also be operated by the user on behalf of others, such as an employer providing the client 510 to the user as part of their employment.

다양한 실시예에서, 클라이언트 컴퓨터(510)의 소프트웨어(512)는 클라이언트 소프트웨어(514) 및/또는 웹 브라우저(516)를 포함한다. 웹 브라우저(514)는, 클라이언트(510)가 (예를 들어, 서버(550)로부터) 웹 페이지 또는 다른 다운로드 가능한 프로그램, 애플릿 또는 문서를 웹-페이지 요청으로 요청할 수 있게 한다. 웹 페이지의 일례는 컴퓨터 실행 가능 또는 인터프리팅 가능한 정보, 그래픽, 사운드, 텍스트 및/또는 비디오를 포함하며, 표시, 실행, 재생, 프로세싱, 스트리밍 및/또는 저장될 수 있고, 다른 웹 페이지에 대한 링크 또는 포인터를 포함할 수 있는 데이터 파일이다. 상업적으로 이용 가능한 웹 브라우저 소프트웨어(516)의 예는 Microsoft Corporation에 의해 제공되는 INTERNET EXPLORER, AOL/Time Warner에 의해 제공되는 NETSCAPE NAVIGATOR, Mozilla Foundation에 의해 제공되는 FIREFOX, 또는 Google에 의해 제공되는 CHROME이다.In various embodiments, software 512 of client computer 510 includes client software 514 and/or web browser 516 . Web browser 514 enables client 510 to request a web page or other downloadable program, applet, or document (eg, from server 550 ) in a web-page request. Examples of web pages include computer-executable or interpretable information, graphics, sound, text and/or video, which may be displayed, executed, reproduced, processed, streamed, and/or stored, and may be linked to other web pages. A data file that may contain links or pointers. Examples of commercially available web browser software 516 are INTERNET EXPLORER provided by Microsoft Corporation, NETSCAPE NAVIGATOR provided by AOL/Time Warner, FIREFOX provided by Mozilla Foundation, or CHROME provided by Google.

일부 실시예에서, 소프트웨어(512)는 클라이언트 소프트웨어(514)를 포함한다. 클라이언트 소프트웨어(514)는 예를 들어, 사용자가 전자 메일, 인스턴트 메시지, 전화 호출, 비디오 메시지, 스트리밍 오디오 또는 비디오, 또는 다른 컨텐츠를 송신 및 수신할 수 있게 하는, 클라이언트(510)에 대한 기능을 제공한다. 클라이언트 소프트웨어(514)의 예는 Microsoft Corporation에 의해 제공되는 OUTLOOK 및 OUTLOOK EXPRESS, Mozilla Foundation에 의해 제공되는 THUNDERBIRD, 및 AOL/Time Warner에 의해 제공되는 INSTANT MESSENGER를 포함하지만, 이에 한정되지 않는다. 중앙 처리 장치, 휘발성 및 비휘발성 저장소, 입력/출력 디바이스 및 디스플레이를 포함하여 클라이언트 컴퓨터와 연관된 표준 컴포넌트는 나타내어지지 않는다.In some embodiments, software 512 includes client software 514 . The client software 514 provides functionality for the client 510, for example, allowing a user to send and receive e-mail, instant messages, phone calls, video messages, streaming audio or video, or other content. do. Examples of client software 514 include, but are not limited to, OUTLOOK and OUTLOOK EXPRESS provided by Microsoft Corporation, THUNDERBIRD provided by the Mozilla Foundation, and INSTANT MESSENGER provided by AOL/Time Warner. Standard components associated with the client computer are not shown, including central processing units, volatile and non-volatile storage, input/output devices, and displays.

일부 실시예에서, 웹 브라우저 소프트웨어(516) 및/또는 클라이언트 소프트웨어(514)는, 클라이언트가 예측 모델링 시스템(100)에 대한 사용자 인터페이스(120)에 액세스하는 것을 허용할 수 있다.In some embodiments, web browser software 516 and/or client software 514 may allow a client to access user interface 120 for predictive modeling system 100 .

서버(550)는 클라이언트(510)와 상호 작용한다. 서버(550)는 충분한 메모리, 데이터 저장 및 프로세싱 능력을 갖고, 서버-클래스 운영 체제(예를 들어, SUN Solaris, GNU/Linux 및 MICROSOFT WINDOWS 운영 체제 패밀리)를 실행하는 하나 이상의 서버-클래스 컴퓨터 상에서 구현되는 것이 바람직하다. 디바이스의 용량 및 사용자 베이스의 크기에 따라, 본원에 구체적으로 설명되는 것과 다른 시스템 하드웨어 및 소프트웨어가 또한 사용될 수 있다. 예를 들어, 서버(550)는 서버 팜(farm) 또는 서버 네트워크와 같은 하나 이상의 서버의 논리적 그룹이거나 그 일부일 수 있다. 다른 예로서, 서로 연관되거나 접속되는 복수의 서버(550)가 존재할 수 있거나, 복수의 서버가 독립적으로 동작할 수 있지만, 공유된 데이터를 갖는다. 추가적인 실시예에서 그리고 대규모 시스템에서 통상적인 것처럼, 애플리케이션 소프트웨어는 상이한 서버 컴퓨터, 동일한 서버 또는 일부 조합에서 실행되는 상이한 컴포넌트를 갖는, 컴포넌트로 구현될 수 있다.Server 550 interacts with client 510 . Server 550 is implemented on one or more server-class computers running server-class operating systems (eg, SUN Solaris, GNU/Linux, and MICROSOFT WINDOWS operating system families) with sufficient memory, data storage, and processing capabilities. It is preferable to be Depending on the capabilities of the device and the size of the user base, other system hardware and software than those specifically described herein may also be used. For example, server 550 may be or be part of a logical grouping of one or more servers, such as a server farm or server network. As another example, there may be a plurality of servers 550 that are associated with or connected to each other, or the plurality of servers may operate independently, but have shared data. In further embodiments and as is common in large-scale systems, the application software may be implemented in components, with different components running on different server computers, the same server, or some combination.

일부 실시예에서, 서버(550)는 예측 모델링 모듈(552), 통신 모듈(556), 및/또는 데이터 저장 모듈(554)을 포함한다. 일부 실시예에서, 예측 모델링 모듈(552)은 모델링 공간 탐색 엔진(110), 모델링 기술의 라이브러리(130), 예측 문제의 라이브러리, 및/또는 모델링 배치 엔진(140)을 구현할 수 있다. 일부 실시예에서, 서버(550)는 통신 모델(556)을 사용하여 예측 모델링 모듈(552)의 출력을 클라이언트(510)에 전달하고 및/또는 프로세싱 노드(들)(570) 상의 모델링 기술의 실행을 감독할 수 있다. 명세서 전체에 걸쳐 설명된 모듈들은 임의의 적절한 프로그래밍 언어 또는 언어들(C++, C#, Java, LISP, BASIC, PERL 등)을 사용하여 소프트웨어 프로그램으로서 및/또는 하드웨어 디바이스(예를 들어, ASIC, FPGA, 프로세서, 메모리, 저장소 등)로서 전체적으로 또는 부분적으로 구현될 수 있다.In some embodiments, the server 550 includes a predictive modeling module 552 , a communication module 556 , and/or a data storage module 554 . In some embodiments, predictive modeling module 552 may implement modeling spatial search engine 110 , library of modeling techniques 130 , library of predictive problems, and/or modeling placement engine 140 . In some embodiments, the server 550 communicates the output of the predictive modeling module 552 to the client 510 using the communication model 556 and/or execution of the modeling technique on the processing node(s) 570 . can supervise Modules described throughout this specification may be implemented as software programs and/or hardware devices (eg, ASICs, FPGAs, processor, memory, storage, etc.), in whole or in part.

데이터 저장 모듈(554)은 예를 들어, 예측 모델링 라이브러리(130) 및/또는 예측 문제의 라이브러리를 저장할 수 있다. 데이터 저장 모듈(554)은 예를 들어, 스웨덴 웁살라에 소재한 MySQL AB에 의한 MySQL Database Server, 미국 캘리포니아주 버클리에 소재한 PostgreSQL Global Development Group에 의한 PostgreSQL Database Server, 또는 미국 캘리포니아주 레드우드 쇼어스에 소재한 ORACLE Copr.에 의해 제공되는 ORACLE Database Server를 사용하여 구현될 수 있다.The data storage module 554 may store, for example, the predictive modeling library 130 and/or the library of predictive problems. The data storage module 554 may be, for example, MySQL Database Server by MySQL AB, Uppsala, Sweden, PostgreSQL Database Server by PostgreSQL Global Development Group, Berkeley, CA, or ORACLE, Redwood Shores, CA, USA. It can be implemented using the ORACLE Database Server provided by Copr.

도 6 내지 도 8은 예측 모델링 시스템(100)의 하나의 가능한 구현을 나타낸다. 도 6 내지 도 8의 논의는 일부 실시예의 예시의 방식으로 주어지며, 결코 한정적인 것은 아니다.6-8 show one possible implementation of the predictive modeling system 100 . The discussion of Figures 6-8 is given by way of illustration of some embodiments and is in no way limiting.

전술한 절차를 실행하기 위해, 예측 모델링 시스템(100)은 다양한 클라이언트 및 서버 컴퓨터 상에서 실행되는 분산 소프트웨어 아키텍처(600)를 사용할 수 있다. 소프트웨어 아키텍처(600)의 목적은 동시에 풍부한 사용자 경험 및 연산 집약적인 프로세싱을 전달하는 것이다. 소프트웨어 아키텍처(600)는 기본 4-계층 인터넷 아키텍처의 변형을 구현할 수 있다. 도 6에 나타낸 바와 같이, 애플리케이션 및 데이터 계층을 통해 조정된 클라우드-기반 연산을 활용하기 위해 이 토대를 확장한다.To carry out the procedures described above, the predictive modeling system 100 may use a distributed software architecture 600 running on a variety of client and server computers. The purpose of the software architecture 600 is to simultaneously deliver a rich user experience and computationally intensive processing. Software architecture 600 may implement a variation of the basic four-tier Internet architecture. As shown in Figure 6, we extend this foundation to leverage cloud-based computation coordinated through the application and data layers.

아키텍처(600)와 기본 4-계층 인터넷 아키텍처 간의 유사점 및 차이점은 다음을 포함할 수 있다:Similarities and differences between architecture 600 and the basic four-tier Internet architecture may include:

(1) 클라이언트(610). 아키텍처(600)는 임의의 다른 인터넷 애플리케이션과 클라이언트(610)에 대해 본질적으로 동일한 가정을 한다. 주요 사용-사례는 복잡한 작업을 수행하기 위해 오랜 시간 동안 자주 액세스하는 것을 포함한다. 따라서, 타겟 플랫폼은 랩톱 또는 데스크톱 상에서 실행되는 풍부한 웹 클라이언트를 포함한다. 그러나, 사용자는 모바일 디바이스를 통해 아키텍처에 액세스할 수 있다. 따라서, 아키텍처는 비교적 얇은 클라이언트-측 라이브러리를 사용하여 Interface Service API에 직접 액세스하는 네이티브 클라이언트(612)를 수용하도록 설계된다. 물론, Java 및 Flash와 같은 임의의 크로스-플랫폼 GUI 계층도 마찬가지로 이러한 API에 액세스할 수 있다.(1) Client 610. Architecture 600 makes essentially the same assumptions for any other Internet application and client 610 . The main use-cases involve frequent access over long periods of time to perform complex tasks. Thus, the target platform includes a rich web client running on a laptop or desktop. However, the user can access the architecture through a mobile device. Accordingly, the architecture is designed to accommodate native clients 612 accessing the Interface Service API directly using a relatively thin client-side library. Of course, any cross-platform GUI layer such as Java and Flash can access these APIs as well.

(2) 인터페이스 서비스(620). 이 아키텍처 층은 기본 인터넷 표시 층의 확장 버전이다. 머신 학습을 지시하는 데 사용될 수 있는 복잡한 사용자 상호 작용으로 인해, 정적 HTML, 동적 HTML, SVG 시각화, 실행 가능한 Javascript 코드, 심지어 자체-포함 IDE를 포함하여 대안적인 구현이 이 층을 통해 광범위하게 다양한 컨텐츠를 지원할 수 있다. 또한, 새로운 인터넷 기술이 발전함에 따라, 구현은 사용자 상호 작용 로직을 실행하기 위해 클라이언트, 표현 및 애플리케이션 층 간에 새로운 형식의 컨텐츠를 수용하거나 노동 분업을 변경해야 할 필요가 있을 수 있다. 따라서, 인터페이스 서비스 층(620)은 다양한 풍부함을 갖는 복수의 컨텐츠 전달 메커니즘과, 인증, 액세스 제어 및 입력 검증과 같은 공통 지원 설비를 함께 통합하기 위한 유연한 프레임워크를 제공할 수 있다.(2) interface service (620). This architectural layer is an extended version of the basic Internet presentation layer. Due to the complex user interactions that can be used to direct machine learning, alternative implementations, including static HTML, dynamic HTML, SVG visualization, executable Javascript code, and even self-contained IDEs, are available through this layer for a wide variety of content. can support Also, as new Internet technologies evolve, implementations may need to accommodate new types of content or change division of labor between client, presentation and application layers to execute user interaction logic. Thus, the interface service layer 620 can provide a flexible framework for integrating together multiple content delivery mechanisms of varying richness and common supporting facilities such as authentication, access control and input validation.

(3) 분석 서비스(630). 아키텍처는 예측 분석 해결책을 생성하는 데 사용될 수 있으므로, 그 애플리케이션 계층은 분석 서비스 전달에 중점을 둔다. 머신 학습의 연산 강도는 클라우드 환경에서 실행되는 다수의 가상 "작업자(worker)"에 대한 머신-학습 작업의 동적 할당과 같은 표준 애플리케이션 계층에 대한 주요 향상을 유도한다. 실행 엔진에 의해 생성된 모든 유형의 로직 연산 요구에 대해, 분석 서비스 층(630)은 요청을 수용하고, 요청을 작업으로 분해하고, 작업을 작업자에게 할당하고, 작업 실행에 필요한 데이터를 제공하고, 실행 결과를 대조하도록 다른 층과 조정한다. 표준 애플리케이션 계층으로부터의 연관된 차이점도 존재한다. 예측 모델링 시스템(100)은, 사용자가 그들 자신의 머신-학습 기술을 개발하도록 허용할 수 있고, 따라서 일부 구현은 하나 이상의 풀 IDE를 제공할 수 있으며, 클라이언트, 인터페이스 서비스 및 분석 서비스 층에 걸쳐 그 기능이 분할된다. 그 후, 실행 엔진은 이러한 IDE를 통해 생성된 새롭고 향상된 기술을 장래의 머신-학습 연산에 통합한다.(3) Analysis Services (630). The architecture can be used to create predictive analytics solutions, so its application layer is focused on delivering analytics services. The computational intensity of machine learning drives major improvements over standard application layers, such as the dynamic allocation of machine-learning tasks to multiple virtual "workers" running in a cloud environment. For any type of logic operation request generated by the execution engine, the analytics service layer 630 accepts the request, decomposes the request into a task, assigns the task to a worker, provides the data needed to execute the task, Coordinate with other layers to match execution results. Associated differences from the standard application layer also exist. Predictive modeling system 100 may allow users to develop their own machine-learning techniques, and thus some implementations may provide one or more full IDEs, and may provide those across client, interface services, and analytics service layers. function is divided. The execution engine then incorporates new and improved technologies generated through these IDEs into future machine-learning computations.

(4) 작업자 클라우드(640). 효율적으로 모델링 연산을 수행하기 위해, 예측 모델링 시스템(100)은 이들을 더 작은 작업으로 분해할 수 있어, 이들을 클라우드 환경에서 실행 중인 가상 작업자 인스턴스에 할당한다. 아키텍처(600)는 상이한 유형의 작업자 및 상이한 유형의 클라우드를 허용한다. 각 작업자 유형은 특정 가상 머신 구성에 대응한다. 예를 들어, 디폴트 작업자 유형은 신뢰할 수 있는 모델링 코드에 대한 일반적인 머신-학습 기능을 제공한다. 그러나, 또 다른 유형은 사용자- 개발한 코드에 대해 추가적인 보안 "샌드박싱(sandboxing)"을 실시한다. 대안적인 유형은 특정 머신-학습 기술에 최적화된 구성을 제공할 수 있다. 분석 서비스 층(630)이 각 작업자 유형의 목적을 이해하는 한, 작업을 적절하게 할당할 수 있다. 유사하게, 분석 서비스 층(630)은 상이한 유형의 클라우드에 있는 작업자를 관리할 수 있다. 조직은 사설 클라우드에 인스턴스의 풀을 유지할 수 있을 뿐만 아니라 공용 클라우드에서 인스턴스를 실행하는 옵션을 가질 수 있다. 다른 종류의 상업용 클라우드 서비스 또는 심지어 독점적인 내부 서비스 상에서 실행되는 인스턴스의 상이한 풀을 가질 수도 있다. 분석 서비스 층(630)이 기능과 비용의 상충 관계를 이해하는 한, 적절하게 작업을 할당할 수 있다.(4) worker cloud (640). In order to efficiently perform modeling operations, the predictive modeling system 100 may decompose them into smaller tasks and assign them to virtual worker instances running in the cloud environment. Architecture 600 allows for different types of workers and different types of clouds. Each worker type corresponds to a specific virtual machine configuration. For example, the default worker type provides general machine-learning capabilities for reliable modeling code. However, another type enforces additional security "sandboxing" of user-developed code. Alternative types may provide configurations optimized for specific machine-learning techniques. As long as the analytics services layer 630 understands the purpose of each worker type, it can assign tasks appropriately. Similarly, the analytics services layer 630 may manage workers in different types of clouds. Organizations can maintain a pool of instances in a private cloud, as well as have the option of running instances in a public cloud. You may have different pools of instances running on different kinds of commercial cloud services or even proprietary internal services. As long as the analytics services layer 630 understands the trade-off between functionality and cost, it can allocate tasks appropriately.

(5) 데이터 서비스(650). 아키텍처(600)는, 다양한 층에서 실행되는 다양한 서비스가 대응하는 다양한 저장 옵션으로부터 이익을 얻을 수 있는 것으로 가정한다. 따라서, 예를 들어, 임의의 유형의 영구 데이터를 위한 파일 저장소, 캐싱과 같은 목적을 위한 임시 데이터베이스, 및 장기 기록 관리를 위한 영구 데이터베이스와 같은 데이터 서비스(650)의 풍부한 어레이를 전달하기 위해 프레임워크를 제공한다. 이러한 서비스는 클라우드 작업자 및 IDE 서버에 대해 사용되는 가상 머신 이미지 파일과 같은 특정 유형의 컨텐츠에 대해 특수화될 수도 있다. 일부 경우, 데이터 서비스 계층(650)의 구현은 특정 유형의 데이터에 대한 특정 액세스 관용구를 실시할 수 있어, 다른 계층이 원활하게 조정할 수 있다. 예를 들어, 데이터 세트 및 모델 결과에 대한 포맷을 표준화한다는 것은, 분석 서비스 계층(630)이 작업자에게 작업을 할당할 때 사용자의 데이터 세트에 대한 참조를 단순히 전달할 수 있음을 의미한다. 그 후, 작업자는 데이터 서비스 계층(650)으로부터 이 데이터 세트에 액세스하고, 차례로 갖고 있는, 데이터 서비스(650)를 통해 저장된 모델 결과에 대한 참조를 반환할 수 있다.(5) data service (650). Architecture 600 assumes that the various services running on the various tiers may benefit from the corresponding various storage options. Thus, for example, a framework for delivering a rich array of data services 650 such as file storage for any type of persistent data, temporary databases for purposes such as caching, and persistent databases for long-term record management. provides These services may also be specialized for certain types of content, such as virtual machine image files used for cloud workers and IDE servers. In some cases, implementations of data services layer 650 may enforce specific access idioms for specific types of data, allowing other layers to coordinate seamlessly. For example, standardizing the format for data sets and model results means that the analytics services layer 630 can simply pass a reference to the user's data set when assigning tasks to workers. The worker can then access this data set from the data service layer 650 and return a reference to the model results stored via the data service 650 , which in turn has.

(6) 외부 시스템(660). 임의의 다른 인터넷 애플리케이션과 마찬가지로, API의 사용은, 외부 시스템이 아키텍처(600)의 임의의 계층에서 예측 모델링 시스템(100)과 통합될 수 있게 한다. 예를 들어, 비즈니스 대시보드 애플리케이션은 그래픽 시각화 및 모델링 결과를 인터페이스 서비스 계층(620)을 통해 액세스할 수 있다. 외부 데이터 웨어하우스 또는 심지어 라이브 비즈니스 애플리케이션도 데이터 통합 플랫폼을 통해 분석 서비스 계층(630)에 모델링 데이터 세트를 제공할 수 있다. 보고 애플리케이션은 데이터 서비스 계층(650)을 통해 특정 기간으로부터 모든 모델링 결과에 액세스할 수 있다. 그러나, 대부분의 상황에서, 외부 시스템은 작업자 클라우드(640)에 대한 직접 액세스를 갖지 않을 것이고; 분석 서비스 계층(630)을 통해 이를 활용할 것이다.(6) External system 660. As with any other Internet application, the use of the API allows external systems to be integrated with the predictive modeling system 100 at any layer of the architecture 600 . For example, the business dashboard application may access graphical visualization and modeling results through the interface service layer 620 . An external data warehouse or even a live business application may provide modeling data sets to the analytics service layer 630 via the data integration platform. The reporting application can access all modeling results from a specific time period through the data service layer 650 . However, in most situations the external system will not have direct access to the worker cloud 640 ; It will utilize this through the analytics service layer 630 .

모든 다층 아키텍처와 마찬가지로, 아키텍처(600)의 계층은 논리적이다. 물리적으로, 상이한 계층으로부터의 서비스가 동일한 시스템 상에서 실행될 수 있고, 동일한 계층의 상이한 모듈이 별개의 머신 상에서 실행될 수 있으며, 동일한 모듈의 복수의 인스턴스가 몇몇 머신에 걸쳐 실행될 수 있다. 마찬가지로, 한 계층의 서비스는 복수의 네트워크 세그먼트에 걸쳐 실행될 수 있으며, 다른 계층으로부터의 서비스는 다른 네트워크 세그먼트 상에서 실행될 수 있거나 실행되지 않을 수 있다. 하지만 논리적 구조는 개발자와 운영자의, 상이한 모듈이 어떻게 상호 작용하는지에 대한 기대를 조정하는 것을 도울 뿐만 아니라, 운영자에게 확장성, 신뢰성 및 보안성과 같은 서비스-레벨 요건을 밸런싱하는 데 필요한 유연성을 제공한다.As with all multi-layer architectures, the layers of architecture 600 are logical. Physically, services from different layers can run on the same system, different modules of the same layer can run on separate machines, and multiple instances of the same module can run across several machines. Likewise, services in one layer may run across multiple network segments, and services from another layer may or may not run on other network segments. However, the logical structure not only helps to reconcile developers and operators' expectations of how different modules will interact, but also gives operators the flexibility they need to balance service-level requirements such as scalability, reliability and security. .

고-레벨 계층은 통상적인 인터넷 애플리케이션의 것과 상당히 유사한 것처럼 보이지만, 클라우드-기반 연산의 추가는, 정보가 시스템을 통해 어떻게 흐르는지를 실질적으로 변화시킬 수 있다.Although the high-level layers appear to be quite similar to those of typical Internet applications, the addition of cloud-based operations can substantially change how information flows through the system.

인터넷 애플리케이션은 통상적으로 동기 및 비동기의 2개의 별개 유형의 사용자 상호 작용을 제공한다. 항공기 찾기 및 예약과 같은 개념적으로 동기식 동작의 경우, 사용자는 다음 요청을 하기 전에 요청을 하고 응답을 기다린다. 특정 기준을 충족하는 온라인 거래에 대한 경보 설정과 같은 개념적으로 비동기 동작의 경우, 사용자는 요청을 하고 장래의 어느 시간에 시스템이 그에게 결과를 통지할 것으로 예상한다.(통상적으로, 시스템은 사용자에게 초기 요청 "티켓"을 제공하고, 지정된 통신 채널을 통해 통지를 제공한다.)Internet applications typically provide two distinct types of user interaction, synchronous and asynchronous. For conceptually synchronous operations, such as finding and booking an aircraft, the user makes a request and waits for a response before making the next request. For conceptually asynchronous operations, such as setting an alert for an online transaction that meets certain criteria, the user makes a request and expects the system to notify him of the result at some time in the future (usually, the system will notify the user Provides an initial request "ticket", and provides a notification over a designated communication channel).

반대로, 머신-학습 모델을 구축하고 정제하는 것은 중간의 어느 지점에서 상호 작용 패턴을 포함할 수 있다. 모델링 문제를 설정하는 것은 초기의 일련의 개념적으로 동기적인 단계를 포함할 수 있다. 그러나, 사용자가 시스템에게 대안적인 해결책의 연산을 개시하도록 명령하면, 대응 연산의 규모를 이해하는 사용자는 즉각적인 응답을 기대할 가능성이 작다. 피상적으로, 지연된 결과에 대한 이러한 예상은 상호 작용의 이 단계를 비동기로 보이게 만든다.Conversely, building and refining machine-learning models may involve interaction patterns at some point in the middle. Setting up a modeling problem may involve an initial series of conceptually synchronous steps. However, if a user instructs the system to initiate computation of an alternative solution, a user who understands the scale of the corresponding computation is less likely to expect an immediate response. Superficially, this anticipation of a delayed outcome makes this phase of the interaction seem asynchronous.

그러나, 예측 모델링 시스템(100)은 통지를 수신할 때까지 사용자가 "발사 후 망각(fire-and-forget)"하도록 강제하지 않으며, 즉, 문제와 그 자신의 관여를 중지시키지 않는다. 실제로, 데이터 세트를 계속 탐색하고 예비 결과가 도착하자마자 이를 검토할 것을 그에게 권장한다. 이러한 추가적인 탐색 또는 초기 지관은 모델-구축 파라미터를 "인-플라이트"로 변경하도록 그에게 고무시킬 수 있다. 그 후, 시스템은 요청된 변경을 프로세싱하고 프로세싱된 작업을 재할당할 수 있다. 예측 모델링 시스템(100)은 사용자의 세션 전반에 걸쳐 이러한 요청-및-수정 다이내믹을 연속으로 허용할 수 있다.However, the predictive modeling system 100 does not force the user to “fire-and-forget” until receiving a notification, ie, stop engaging with the problem itself. In fact, I encourage him to continue exploring the data set and review preliminary results as soon as they arrive. This additional search or initial intelligence may inspire him to change the model-building parameters "in-flight". The system can then process the requested change and reallocate the processed task. The predictive modeling system 100 may continuously allow for such request-and-modification dynamics throughout a user's session.

따라서, 분석 서비스 및 데이터 서비스 계층은 한편으로는 사용자로부터의 요청-응답 루프와 다른 한편으로는 작업자 클라우드로의 요청-응답 루프 사이를 중재할 수 있다. 도 7은 이러한 관점을 나타낸다:Thus, the analytics service and data service layers can mediate between the request-response loop from the user on the one hand and the request-response loop to the worker cloud on the other hand. 7 shows this view:

도 7은, 예측 모델링 시스템(100)이 계층화된 모델로 반드시 꼭 맞춤화되지는 않는다는 것을 강조하고, 이는 각 계층이 대부분 그 계층 바로 아래의 계층에만 의존한다고 가정한다. 오히려, 분석 서비스(630) 및 데이터 서비스(650)는 사용자 및 연산을 협업하여 조정한다. 이러한 관점에서, 정보 흐름의 3개의 "열"이 있다:Fig. 7 emphasizes that the predictive modeling system 100 is not necessarily tailored to a layered model, which assumes that each layer mostly depends only on the layer immediately below it. Rather, analytics service 630 and data service 650 coordinate users and operations collaboratively. From this perspective, there are three "columns" of information flow:

(1) 인터페이스 <-> 분석 + 데이터. 가장 왼쪽의 흐름의 열(710)은 우선 사용자의 미가공 데이터 세트 및 모델링 요건을 정제된 데이터 세트 및 연산 작업 목록으로 변환한 다음, 이를 합쳐 사용자가 쉽게 이해할 수 있는 포맷으로 결과를 사용자에게 전달한다. 따라서 목표와 제약은 인터페이스 서비스(620)에서 분석 서비스(630)로 흐르며, 진행과 예외는 반대로 돌아간다. 병렬적으로, 미가공 데이터 세트와 사용자 주석은 인터페이스 서비스(620)에서 데이터 서비스(650)로 흐르며, 트레이닝된 모델과 그 성능 메트릭은 반대로 흐른다. 사용자는, 임의의 지점에서 분석 서비스(630) 및 데이터 서비스(650) 계층에 의한 변경 및 강제 조정을 개시할 수 있다. 이 동적 순환 흐름에 더하여, (예를 들어, 인터넷 서비스(620)가 분석 서비스(640)로부터 시스템 상태를 리트리빙하고 데이터 서비스(650)로부터 정적 컨텐츠를 검색할 때) 보다 통상적인 선형 상호 작용이 또한 존재함에 유의한다.(1) Interface <-> analysis + data. Column 710 in the leftmost flow first transforms the user's raw data set and modeling requirements into a list of refined data sets and computational tasks, which are then combined to deliver the results to the user in a format that the user can easily understand. Thus, goals and constraints flow from interface service 620 to analysis service 630 , and progress and exceptions are reversed. In parallel, the raw data set and user annotations flow from the interface service 620 to the data service 650, and the trained model and its performance metrics flow in reverse. The user can initiate changes and enforcement adjustments by the analytics service 630 and data service 650 layers at any point. In addition to this dynamic cyclical flow, more typical linear interactions (eg, when Internet service 620 retrieves system state from analytics service 640 and retrieves static content from data service 650 ) Also note that it exists.

(2) 분석 + 데이터 <-> 작업자. 가장 우측의 흐름의 열(730)은 작업자를 프로비저닝하고, 연산 작업을 할당하고, 이러한 작업에 대한 데이터를 제공한다. 따라서, 작업 할당, 그 파라미터 및 데이터 참조는 분석 서비스(630)에서 작업자 클라우드(640)로 흐르며, 진행 및 예외는 반대로 흐른다. 정제된 데이터 세트는 데이터 서비스(650)로부터 작업자 클라우드(640)로 흐르고, 모델링 결과는 반대로 흐른다. 사용자로부터의 갱신된 지시는 분석 서비스 계층(630)이 인-플라이트 작업자를 인터럽팅하고 갱신된 모델링 작업을 할당하는 것을 강제할 뿐만 아니라, 데이터 서비스 계층(650)으로부터의 데이터 세트의 리프레시를 강제할 수 있다. 차례로, 갱신된 할당 및 데이터 세트는 결과의 흐름을 작업자로부터 다시 변경한다.(2) analysis + data <-> worker. Column 730 of the rightmost flow provisions workers, assigns computational tasks, and provides data for these tasks. Thus, task assignments, their parameters, and data references flow from the analytics service 630 to the worker cloud 640 , with progress and exceptions flowing in reverse. The refined data set flows from the data service 650 to the worker cloud 640 , and the modeling results flow in the opposite direction. The updated instruction from the user will force the analytics service layer 630 to interrupt the in-flight workers and assign updated modeling tasks, as well as force a refresh of the data set from the data service layer 650 . can In turn, the updated assignments and data sets change the flow of results back from the worker.

(3) 분석 <-> 데이터. 중간의 2개의 계층은 왼쪽과 오른쪽 열 사이를 중재하기 위해 자체 사이에서 조정한다. 이 트래픽(720)의 대부분은 클라우드 작업자의 실행 진행 및 중간 계산을 추적하는 데 관련된다. 그러나, 모델-구축 명령어에 대한 전술한 인-플라이트 변경에 응답할 때 특히 흐름은 복잡해질 수 있으며; 분석 및 데이터 서비스는 현재 연산 상태를 평가하고, 어떠한 중간 계산이 여전히 유효한지를 결정하며 새로운 연산 작업을 올바르게 구성한다. 물론, (분석 서비스가 데이터 서비스로부터 클라우드 작업자에 대한 규칙 및 구성을 리트리빙할 때) 또한 보다 통상적인 선형 상호 작용이 여기에서도 존재한다.(3) Analysis <-> data. The middle two layers mediate between themselves to mediate between the left and right columns. Most of this traffic 720 is related to tracking the execution progress and intermediate computations of cloud workers. However, the flow can be complicated, especially when responding to the aforementioned in-flight changes to model-building instructions; The analytics and data service evaluates the current computational state, determines which intermediate computations are still valid, and correctly constructs new computational tasks. Of course, there is also a more conventional linear interaction here (when the analytics service retrieves the rules and configuration for the cloud worker from the data service).

이러한 정보 흐름의 개념적 모델은 계층들 내의 기능 모듈의 배열에 대한 컨텍스트를 제공한다. 이는 더 높은 레벨의 블록에 애플리케이션 프로그래밍 인터페이스(API)를 제공하고 더 낮은 레벨의 블록에서 API를 소비하는 단순한 비상태 블록이 아니다. 오히려, 이는 사용자와 연산 간의 협업에 동적인 참여자이다. 도 8은 이들 기능 모듈의 배열을 나타낸다. 사용자의 관점에서, 인터페이스 서비스 계층은 몇몇 기능의 구분된 영역을 제공한다.This conceptual model of information flow provides context for the arrangement of functional modules within hierarchies. It is not just a stateless block that provides application programming interfaces (APIs) to higher-level blocks and consumes APIs in lower-level blocks. Rather, it is a dynamic participant in the collaboration between the user and the operation. 8 shows the arrangement of these functional modules. From the user's point of view, the interface service layer provides several distinct areas of functionality.

(1) 사용자/프로젝트 관리자(802). 각각의 머신-학습 프로젝트는 인터페이스의 프로젝트 관리 컴포넌트를 사용하여 프로젝트-레벨 파라미터, 책임 및 자원을 관리할 수 있는 적어도 하나의 할당된 관리자를 갖는다. 이 기능적 컴포넌트는 또한 시스템-레벨 관리 기능을 지원한다.(1) User/Project Manager (802). Each machine-learning project has at least one assigned administrator who can manage project-level parameters, responsibilities, and resources using the project management component of the interface. This functional component also supports system-level management functions.

(2) 모니터링(810). 이 모듈은 컴퓨팅 인프라스트럭쳐 상에서 진단을 제공한다. 이는 분석 서비스 계층에서 대응하는 모듈(818)과 협업하여 각 작업자 인스턴스에 대한 실시간 및 각 연산 작업에 대한 총계 모두에서 연산 자원 사용을 추적한다.(2) Monitoring (810). This module provides diagnostics on the computing infrastructure. It cooperates with the corresponding module 818 in the analytics service layer to track computational resource usage in both real-time for each worker instance and aggregate for each computational task.

(3) 기술 설계자(804). 이 모듈은 전술한 방법, 기술 및 작업 빌더를 사용하기 위한 그래픽 인터페이스를 지원한다. 이 그래픽 인터페이스가 어떻게 구현될 수 있는지의 예는 클라이언트(610)에서 실행되고 AJAX 요청을 통해 기술 설계자(804)와 통신하는 Javascript이며, 그래프를 사용자에게 그래픽으로 렌더링하고 변경을 서버에 다시 푸싱(pushing)한다.(3) Technical Designer (804). This module supports a graphical interface for using the methods, techniques and task builders described above. An example of how this graphical interface can be implemented is Javascript running on the client 610 and communicating with the technical architect 804 via AJAX requests, rendering the graph graphically to the user and pushing changes back to the server. )do.

(4) 기술 IDE(812). 전술한 바와 같이, 예측 모델링 시스템(100)의 일부 구현은 리프-레벨 작업을 구현하기 위해 기술 개발자에게 IDE에 대한 빌트인 액세스를 제공할 수 있다. 이러한 IDE는 Python 또는 R과 같은 특수 과학 컴퓨팅 환경과 같은 머신-학습에 사용되는 범용 프로그래밍 언어를 지원할 수 있다. 이 기능은 클라이언트(610), 인터페이스 서비스(620) 및 분석 서비스(630) 계층에 걸쳐 실행될 수 있다. 클라이언트 컴포넌트(610)는 AJAX를 통해 인터페이스 서비스 컴포넌트로 세션을 우선 등록하는 IDE 환경에 대한 Javascript 컨테이너를 다운로드 및 실행할 수 있다. 등록 요청을 인증 및 검증한 후에, 인터페이스 서비스 컴포넌트는 사용자의 프로젝트 데이터를 클라이언트(610)로 다운로드하고 세션을 분석 서비스 계층에서 실행 중인 전용 IDE 서버 인스턴스로 넘긴다. 이 서버 인스턴스는 그 후 웹 소켓을 통해 클라이언트(610)와 직접 통신한다.(4) Technology IDE (812). As noted above, some implementations of predictive modeling system 100 may provide technology developers with built-in access to an IDE to implement leaf-level tasks. These IDEs can support general-purpose programming languages used for machine-learning, such as special scientific computing environments such as Python or R. This function may be executed across the client 610 , interface service 620 and analytics service 630 layers. The client component 610 may download and execute a Javascript container for an IDE environment that first registers a session as an interface service component through AJAX. After authenticating and validating the registration request, the interface service component downloads the user's project data to the client 610 and passes the session to a dedicated IDE server instance running in the analytics service layer. This server instance then communicates directly with the client 610 via a web socket.

(5) 데이터 도구(806). 이 모듈은, 모델 빌더가 데이터 세트를 특정하고, 이를 이해하고, 모델-구축을 위해 이를 준비할 수 있게 한다.(5) Data Tools (806). This module enables model builders to specify data sets, understand them, and prepare them for model-building.

(6) 모델링 대시보드(814). 각 프로젝트는 그 자체 모델링 대시보드를 갖는다. 이 모듈의 인스턴스는 모델 빌더에게 제어 및 척도를 제공하여 프로젝트에 대한 모델링 프로세스를 론칭하고 도착한 대로 결과를 측정하고, 인-플라이트 조정을 수행한다. 어떤 모델링 기술이 어떤 데이터 세트에 대해 실행되는지 계산하고, 이러한 요건을 분석 서비스 계층에 전달한다. 일단 실행 엔진이 모델 구축을 시작하면, 이 모듈은 실행 상태와 제어를 제공한다.(6) Modeling Dashboard (814). Each project has its own modeling dashboard. Instances of this module provide control and metrics to model builders to launch the modeling process for a project, measure results as they arrive, and perform in-flight adjustments. It calculates which modeling techniques are running on which data sets and passes these requirements to the analytics service layer. Once the execution engine starts building the model, this module provides execution state and control.

(7) 직관(808). 일단 머신-학습 프로세스가 실질적인 결과를 생성하기 시작하면, 이 모듈은 모델 빌더에게 더 깊은 직관을 제공한다. 예는 텍스트 마이닝 개요, 예측자 중요도 및 각 예측자와 타겟 간의 일방적 관계를 포함한다. 이러한 직관의 대부분은 이해하기 쉽고 통계에 대한 깊은 지식을 필요로 하지 않는다.(7) Intuition (808). Once the machine-learning process begins to produce practical results, this module provides model builders with deeper intuition. Examples include a text mining overview, predictor importance, and a one-way relationship between each predictor and target. Most of these intuitions are easy to understand and do not require deep knowledge of statistics.

(8) 예측(816). 일단 실행 엔진이 적어도 하나의 모델을 구축하면, 이 모듈은 새로운 데이터에 기초하여 예측을 하기 위한 인터페이스를 제공한다.(8) Prediction (816). Once the execution engine has built at least one model, this module provides an interface for making predictions based on new data.

인터페이스 서비스 계층의 활동은 분석 서비스 계층에서 활동을 트리거한다. 전술한 바와 같이, 기술 IDE 및 모니터링 모듈은, 부분적으로 분석 서비스 계층에서 실행되도록 분할된다(모니터링 모듈(818) 및 기술 IDE 모듈(820) 참조). 이 레이어의 다른 모듈은 이하를 포함한다:An activity in the interface service layer triggers an activity in the analytics service layer. As mentioned above, the technology IDE and monitoring module are partitioned to run in part on the analytics service layer (see monitoring module 818 and technology IDE module 820 ). Other modules in this layer include:

(1) 작업 큐(822). 각각의 프로젝트는 대응하는 모델링 대시보드 인스턴스로부터의 모델 연산 요청을 서비스하는 자체 작업 큐 인스턴스를 가질 수 있다. 작업은 프로젝트의 데이터 세트의 파티션에 대한 참조, 모델링 기술 및 프로젝트 내의 우선 순위를 포함한다. 그 후, 이 모듈은 작업의 우선 순위화된 목록을 구성 및 유지 관리한다. 연산 자원이 이용 가능할 때, 브로커(824)는 작업 큐로부터 다음 작업을 요청한다. 충분한 권한을 가진 사용자는 임의의 시간에 큐에서 모델링 작업을 추가, 제거 또는 다시 우선 순위화할 수 있다. 큐는 일시적인 DB 모듈(826)을 통해 지속되며, 그 백엔드 저장은 극도로 빠른 응답 시간을 제공한다.(1) job queue (822). Each project may have its own work queue instance servicing model computation requests from the corresponding modeling dashboard instance. Tasks include references to partitions in the project's data set, modeling techniques, and priorities within the project. After that, the module constructs and maintains a prioritized list of tasks. When computing resources are available, the broker 824 requests the next job from the job queue. Users with sufficient privileges can add, remove, or re-prioritize modeling jobs from the queue at any time. The queue is persisted through the transient DB module 826, and its backend storage provides extremely fast response times.

(2) 브로커(824). 이들 모듈은 작업자를 인스턴스화하고, 이들에 작업을 할당하고, 이들의 건강을 모니터링한다. 하나의 브로커가 각 작업자 클라우드에 대해 실행될 수 있다. 브로커는 작업자 인스턴스를 동적으로 제공하고 종료하여 열린 작업 큐 플러스(plus) 안전한 버퍼로부터의 현재 요구 레벨을 서비스한다. 론칭시에, 각각의 작업자는 그 클라우드 환경에 대한 브로커에 자동으로 등록하여, 그 연산 기능에 대한 정보를 제공한다. 브로커 및 작업자는 매 수초마다 서로 심박 메시지를 전송한다. 작업자는 그 브로커와 충돌하거나 연락이 끊어지면 자동으로 재시작하고 재등록할 것이다. 브로커는 이용 가능한 자원의 풀에서 작업자를 폐기하고 너무 많은 심박 메시지가 누락되면 경고를 로그한다. 작업 큐에서 새로운 작업이 도착하고 작업자가 기존 작업을 완료하면, 브로커는 지속적으로 작업자 수와 이러한 작업자에 대한 작업의 할당을 재계산한다.(2) Brokers (824). These modules instantiate workers, assign tasks to them, and monitor their health. One broker can run for each worker cloud. The broker dynamically provisions and terminates worker instances to service the current request level from the open work queue plus secure buffers. Upon launch, each worker automatically registers with the broker for the cloud environment, and provides information on its computational function. Brokers and workers send heartbeat messages to each other every few seconds. The worker will automatically restart and re-register if it crashes or loses contact with that broker. The broker retires workers from the pool of available resources and logs an alert if too many heartbeat messages are missed. As new jobs arrive from the job queue and workers complete existing jobs, the broker continuously recalculates the number of workers and the assignment of jobs to these workers.

(3) 작업자 클라우드(640). 이 모듈은 작업자들의 풀(pool)을 포함한다. 각 작업자는 해당 클라우드 환경 내에서 실행 중인 가상 머신 인스턴스 또는 자체-포함 연산 자원의 다른 유닛이며, 대응 브로커로부터 작업을 수신한다. 작업자의 관점에서 볼 때, 작업은 프로젝트에 대한 참조, 프로젝트의 데이터 세트의 분할 및 모델링 기술을 포함한다. 할당된 모델링 기술의 각 작업에 대해, 작업자는 우선 모델링 결과에 대한 특별한 디렉토리 서브트리를 갖는 파일 저장 모듈(830)에 질의하여 프로젝트의 그 데이터 세트 분할에 대해 임의의 다른 작업자가 완료했는지를 확인할 수 있다. 단계를 프로세싱하는 것이 제1 작업자라면, 제1 작업자는 계산을 수행하고 이를 파일 저장소(830)에 저장하여 다른 작업자가 이를 재사용할 수 있다. 모델링 기술은 공통 모델링 작업 라이브러리의 작업으로부터 어셈블링되기 때문에, 모델링 기술에 걸쳐 작업 실행의 공통성의 상당한 레벨일 있을 수 있다. 작업 실행의 결과를 캐싱하는 것은 소비되는 연산 자원의 양을 상당히 감소시키도록 구현을 허용할 수 있다.(3) worker cloud (640). This module contains a pool of workers. Each worker is a virtual machine instance or other unit of self-contained computational resources running within that cloud environment and receives work from its corresponding broker. From the operator's point of view, work involves reference to the project, segmentation of the project's data set, and modeling techniques. For each task in the assigned modeling skill, the operator can first query the file storage module 830, which has a special directory subtree for modeling results, to see if any other operator has completed the partitioning of that data set in the project. have. If it is the first worker processing the step, the first worker performs the calculation and stores it in file storage 830 so that other workers can reuse it. Because modeling techniques are assembled from the jobs in a common modeling job library, there can be a significant level of commonality in the execution of jobs across modeling techniques. Caching the results of task execution may allow implementations to significantly reduce the amount of computational resources consumed.

데이터 서비스 계층(650)은 다른 계층의 모듈을 지원하기 위해 다양한 상이한 저장 메커니즘을 제공한다.The data service layer 650 provides a variety of different storage mechanisms to support modules of different layers.

(1) 임시 DB(826). 이 모듈은, 극히 빠른 액세스로부터 이익을 얻고 및/또는 일시적인 데이터를 위한 저장 메커니즘에 대한 인터페이스를 제공하고 그 저장 메커니즘을 관리한다. 일부 구현에서 자동 장애 조치가 있는 마스터-슬레이브 구성에 배치된 인-메모리 DBMS를 사용한다. 이 모듈은 객체를 키-값 쌍으로 저장하기 위한 인터페이스를 제공한다. 키는 특정 사용자 및 프로젝트에 링크되지만 여전히 매우 작다. 값은 스트링, 목록 또는 세트일 수 있다.(1) Temporary DB (826). This module benefits from extremely fast access and/or provides an interface to and manages a storage mechanism for ephemeral data. Some implementations use an in-memory DBMS deployed in a master-slave configuration with automatic failover. This module provides an interface for storing objects as key-value pairs. Keys are linked to specific users and projects, but are still very small. Values can be strings, lists, or sets.

(2) 영구 DB(828). 이 모듈은 영구적인 데이터에 대한 저장 메커니즘에 대한 인터페이스를 제공하고 그 저장 메커니즘을 관리한다. 일부 구현에서, 이 모듈에 의해 처리되는 주요 유형의 데이터에는 JSON 객체를 포함할 수 있으며, 높은 가용성 및 고성능 모두를 위해 자동 장애 조치를 갖춘 클러스터에 배치되는, 확장성이 큰 논(non)-SQL 데이터베이스를 사용할 수 있다. 이 모듈을 통해 저장된 객체는 통상적으로 그 크기가 수 메가바이트까지의 범위이다.(2) Persistent DB (828). This module provides an interface to a storage mechanism for persistent data and manages the storage mechanism. In some implementations, the main types of data processed by this module may include JSON objects, and highly scalable non-SQL, deployed in clusters with automatic failover for both high availability and high performance. database can be used. Objects stored through this module typically range in size to several megabytes.

(3) 파일 저장(830). 이 모듈은 파일에 대한 저장 메커니즘에 대한 인터페이스를 제공하고 그 저장 메커니즘을 관리한다. 이 모듈을 통해 저장된 데이터 유형은 업로드된 데이터 세트, 도출된 데이터, 모델 연산 및 예측을 포함한다. 이 모듈은 클라우드 저장소 톱 상의 파일 디렉토리 및 명명 규칙에 겹칠 수 있다. 또한, 클라우드 작업자가 이 모듈에 액세스하면, 저장된 파일을 로컬 저장소에 또한 일시적으로 캐싱할 수 있다.(3) Save file (830). This module provides an interface to the storage mechanism for files and manages the storage mechanism. Data types stored through this module include uploaded data sets, derived data, model computations and predictions. This module can overlap file directories and naming conventions on top of cloud storage. Additionally, when cloud workers access this module, they can also temporarily cache stored files in local storage.

(4) VM 이미지 저장소(832). 이 모듈은 IDE 및 작업자 인스턴스를 실행하는 데 사용되는 VM 이미지에 대한 저장소에 대한 인터페이스를 제공하고 그 저장소를 관리한다. 이는 이미지를 자족형 VM 컨테이너 포맷으로 저장한다. IDE 인스턴스의 경우, 세션에 걸쳐 사용자의 상태를 유지하면서, 새로운 작업자 인스턴스를 해당 작업자 유형의 템플릿으로부터 빈 사본으로서 로드한다.(4) VM image storage (832). This module provides an interface to and manages storage for VM images used to run the IDE and worker instances. It stores the image in a self-contained VM container format. In the case of an IDE instance, a new worker instance is loaded as an empty copy from the template of that worker type, preserving the user's state across sessions.

이들 서비스는 함께, 다음을 비롯한, 광범위하게 다양한 정보를 관리한다:Together, these services manage a wide variety of information, including:

(1) UI 세션(834): 활성 사용자 세션의 현재 상태를 렌더링하고 간단한 요청 인증 및 액세스 제어를 수행하는 데이터.(1) UI Session 834: Data that renders the current state of an active user session and performs simple request authentication and access control.

(2) UI 객체(836): UI에 의해 표시되는 컨텐츠.(2) UI object 836: content displayed by the UI.

(3) 캐시(838): 캐싱된 애플리케이션 컨텐츠.(3) Cache 838: Cached application content.

(4) 시스템 구성(840): 컴퓨팅 인프라 스트럭처를 론칭하고 모델 검색 서비스를 실행하는 구성 파라미터.(4) system configuration 840: configuration parameters for launching the computing infrastructure and running the model retrieval service.

(5) 시스템 상태(842): 시스템(600)의 모듈로부터 수집된 실시간 데이터.(5) System Status 842: Real-time data collected from modules of system 600.

(6) 사용자/프로젝트 관리자(844): 각 프로젝트의 설정 및 사용자 특권뿐만 아니라 개별 사용자 설정.(6) User/Project Manager 844: Individual user settings as well as settings and user privileges for each project.

(7) 데이터 세트(846): 사용자에 의해 프로젝트를 위해 업로드된 데이터 파일.(7) data set 846: data files uploaded for the project by the user.

(8) 모델링 계산(848): 중간 모델링 결과, 최종 맞춤화된 모델 및 계산된 예측.(8) Modeling Calculations (848): Interim modeling results, final customized models and calculated predictions.

(9) VM 이미지(850): 새로운 IDE 서버를 론칭하는 데 사용되는 이미지.(9) VM Image (850): An image used to launch a new IDE server.

이 경우에도 전술한 특정 모듈(802-850)은 논리적 구성이다. 각 모듈은 많은 다른 소스 파일로부터의 실행 코드를 포함할 수 있으며, 주어진 소스 파일은 많은 다른 모듈에 기능을 제공할 수 있다.Even in this case, the above-described specific modules 802-850 are logical configurations. Each module can contain executable code from many other source files, and a given source file can provide functionality to many other modules.

시간외 예측Out-of-hours forecast

일부 실시예에서, 예측 모델링 시스템(100)은 시간 t 및 선택적으로 t+1,..., t+i에서 타겟 X의 값을 예측할 수 있는 시계열 모델을 포함하고, t 이전의 시간들에서 X의 관측을 하고, 선택적으로 t 이전의 시간들에서 다른 예측자 변수 P의 관측을 행한다. 일부 실시예에서, 예측 모델링 시스템(100)은 과거의 관측치를 분할하여 감독 학습 모델을 트레이닝하고, 그 성능을 측정하며, 정확도를 향상시킨다. 일부 실시예에서, 시계열 모델은 예를 들어, 상이한 래그(lag)에서의 타겟의 이전 값을 예측하는 유용한 시간-관련 예측 피처를 제공한다. 일부 실시예에서, 예측 모델링 시스템(100)은 시간이 진행되고 새로운 관측치가 도달함에 따라, 이러한 관측치 및 모델을 다시 맞춤화하는 비용에 있어서 새로운 정보의 양을 고려하여 시계열 모델을 리프레시한다.In some embodiments, predictive modeling system 100 includes a time series model capable of predicting the value of target X at time t and optionally t+1,..., t+i, wherein X at times prior to t , and optionally other predictor variables P at times before t. In some embodiments, the predictive modeling system 100 partitions past observations to train a supervised learning model, measure its performance, and improve accuracy. In some embodiments, time-series models provide useful time-related prediction features, eg, for predicting previous values of a target at different lags. In some embodiments, the predictive modeling system 100 refreshes the time series model as time progresses and new observations arrive, taking into account the amount of new information in these observations and the cost of refitting the model.

시계열 모델의 유익한 사용을 예시하는 예가 이하에 설명된다. 이 예에서 슈퍼마켓 체인은 슈퍼마켓의 각 위치에 대해 다음 6주간의 일일 판매를 예측하기를 원한다. 이용 가능한 데이터는 10,000개의 위치로부터 이전 일일 판매의 3년치 및 다른 변수(예를 들어, 각 위치 주변의 인구 및 경제 성장, 휴일 및 주요 사회적 이벤트의 이력 및 계획 날짜, 및 체인 프로모션의 계획된 날짜)를 포함한다. 일부 실시예에서, 이용 가능한 데이터에 대해 트레이닝된 시계열 모델은 슈퍼마켓의 각 위치에 대한 다음 6 주간의 일일 판매를 정확하게 예측할 수 있다.An example illustrating the beneficial use of a time series model is described below. In this example, the supermarket chain wants to predict daily sales for the next six weeks for each location in the supermarket. Available data includes three years' worth of prior daily sales from 10,000 locations and other variables (e.g., population and economic growth around each location, history and planned dates of holidays and major social events, and planned dates of chain promotions). include In some embodiments, a time series model trained on the available data can accurately predict daily sales for the next 6 weeks for each location in the supermarket.

시계열 모델을 생성하고 사용하기 위한 기술의 일부 실시예가 이하에 설명된다. 데이터 과학자가 모델링 기술 빌더(220)로 예측 모델링 기술을 개발할 때, 데이터 과학자는 모델링 기술이 시계열 예측 문제에 특유하다는 것을 나타낼 수 있다. 모델링 기술 빌더(220)는 그 후 모델링 기술의 메타데이터에서 이러한 특성을 인코딩한다. 데이터 세트 자체는 또한 시계열 특정 메타데이터(예를 들어, 데이터가 생성된 날짜 범위, 관측치의 시간 분해능, 이미 발생한 다운-샘플링 등)를 가질 수 있다.Some embodiments of techniques for creating and using time series models are described below. When a data scientist develops a predictive modeling technique with the modeling technique builder 220 , the data scientist may indicate that the modeling technique is specific to a time series prediction problem. The modeling technology builder 220 then encodes these properties in the modeling technology's metadata. The data set itself can also have time series specific metadata (eg, the date range for which the data was generated, the time resolution of the observations, down-sampling that has already occurred, etc.).

탐색 엔진(110)이 (예를 들어, 도 4에 나타낸 방법(400)의 단계(404)에서) 데이터 세트를 로드할 때, 데이터 세트가 시계열 데이터를 포함하는 것으로 보이는지 여부를 자동으로 검출할 수 있으며, 검출된다면, 시간 인덱스가 있는 것으로 보인다. 시간 인덱스는 시간 분해능 및 시간 스텝("시간 간격")을 포함할 수 있다. 시간 분해능은 시간이 유지되는 단위(예를 들어, 초 또는 일)이다. 예를 들어, 날짜(예를 들어, 모든 날짜)가 표준 날짜 포맷(예를 들어, mm/dd/yyyy)으로 인코딩되는 경우, 엔진(110)은 날짜를 분해능으로서 사용할 수 있다. 다른 예로서, 날짜(예를 들어, 모든 날짜)가 시, 분 및 초를 포함하는 경우, 엔진(110)은 시간 분해능으로서 초를 사용할 수 있다. 유사하게, 엔진(110)은 공통 시간 분해능으로서 임의의 적절한 시간 메트릭(예를 들면, 천년, 백년, 십년, 년, 분기, 계절, 월, 주, 일, 시간, 분, 초, 밀리 초, 마이크로초, 나노초 등)을 사용할 수 있다. 시간 스텝은 연속 관측치(예를 들어, 매일, 매주 또는 매년의 데이터) 사이의 시기간(예를 들어, 최소 시기간, 가장 통상적인 시기간, 사용자 특정 시기간 등)이다.When the search engine 110 loads the data set (eg, at step 404 of the method 400 shown in FIG. 4 ), it may automatically detect whether the data set appears to contain time series data. and if detected, it appears that there is a time index. The time index may include a time resolution and a time step (“time interval”). Time resolution is the unit in which time is maintained (eg, seconds or days). For example, if a date (eg, all dates) is encoded in a standard date format (eg, mm/dd/yyyy), engine 110 may use the date as the resolution. As another example, if a date (eg, all dates) includes hours, minutes, and seconds, engine 110 may use seconds as the time resolution. Similarly, engine 110 may generate any suitable time metric (eg, thousand years, hundred years, ten years, years, quarters, seasons, months, weeks, days, hours, minutes, seconds, milliseconds, microseconds) as a common time resolution. seconds, nanoseconds, etc.) can be used. A time step is a period of time (eg, minimum period of time, most common period of time, user-specified period of time, etc.) between consecutive observations (eg, daily, weekly, or yearly data).

데이터 세트가 시간 분해능의 혼합을 갖는 시계열 데이터를 포함하는 경우, 엔진(110)은 시간 인덱스의 일부로서 가장 공통적인 분해능을 사용할 수 있거나, 모든 시간 데이터를 공통 분해능으로 변환한 후 "최저 공통 분해능"을 사용할 수 있다. 일부 실시예에서, 엔진(110)은 인덱스(예를 들어, 최적 인덱스)를 선택하기 위해 잠재적인 인덱스의 빈도 및 불일치를 가중화하는 내부 목적 함수를 사용한다. 예를 들어, 날짜 변수의 90%가 일(day) 분해능이고 10%가 초의 분해능인 경우, 목적 함수는, 일 분해능이 최상의 선택인 것으로 결정할 수 있다. 역의 혼합(90%의 초의 분해능, 10%의 일 분해능)은, 초의 분해능이 최상의 선택이라는 결정을 산출할 수 있다. 50% 혼합에서, 목적 함수는 일과 초 사이의 분해능(예를 들어, 시간)에 대한 절충이 최적인 것으로 결정할 수 있다. 목적 함수의 선택은 이전에 사용된 목적 함수의 공간에 대한 메타-머신 학습과 다양한 특성을 갖는 예측 문제에 대한 정확도에 의해 추가로 결정될 수 있다.If the data set contains time series data with a mixture of temporal resolutions, the engine 110 may use the most common resolution as part of the temporal index, or "lowest common resolution" after converting all temporal data to the common resolution. can be used In some embodiments, engine 110 uses an internal objective function that weights the frequency and inconsistency of potential indexes to select an index (eg, an optimal index). For example, if 90% of the date variables are days resolution and 10% is seconds resolution, the objective function may determine that day resolution is the best choice. An inverse mix (90% resolution of seconds, 10% resolution of work) may yield a decision that the resolution of seconds is the best choice. At a 50% blend, the objective function can determine that the trade-off for resolution (eg, time) between days and seconds is optimal. The choice of the objective function can be further determined by meta-machine learning on the space of the objective function used previously and the accuracy for prediction problems with various characteristics.

일부 실시예에서, 사용자는, 데이터 세트가 시계열 데이터 세트인지 또는 아닌지에 대한 탐색 엔진의 결정을 (예를 들어, 방법(400)의 단계(406)에서) 무시할 수 있으며, 데이터 세트의 처리를 시계열 데이터 세트로서 차단 또는 강제한다. 사용자가 데이터 세트의 처리를 시계열 데이터 세트로 강제로 선택하는 경우, 사용자는 어떠한 변수가 시간-기반인지, 데이터 포맷을 어떻게 인터프리팅하는지, 및/또는 시간 인덱스를 특정할 수 있다. 일부 경우에, 사용자는 주어진 시간 인덱스를 사용하도록 변수를 어떻게 변환하는지에 대한 정보를 제공할 수 있다.In some embodiments, the user may override the search engine's determination (eg, at step 406 of method 400 ) as to whether or not the data set is a time series data set, and may override the time series processing of the data set. Block or force as a data set. When the user forcibly selects the processing of the data set as a time series data set, the user can specify which variables are time-based, how to interpret the data format, and/or the time index. In some cases, the user may provide information on how to transform a variable to use a given temporal index.

자동으로 검출된 시계열 데이터 세트에 대해, 사용자는 제안된 시간 인덱스를 수용하거나, 임의의 자동으로 검출된 대안으로부터 선택하거나, 또는 커스텀 시간 인덱스를 특정할 수 있으며, 이는 또한 커스텀 시간 인덱스를 사용하기 위해 변수를 어떻게 변환할지에 대해 특정하는 것을 필요로 할 수 있다. 일단 시간 인덱스가 수용, 선택 또는 특정되면, 다른 시간 변수가 서로 오프셋으로 변환되어 예측 피처로 사용될 수 있다. 예를 들어, 데이터 세트가 시간 인덱스로서 mm/dd/yyyy를 사용하고 변수들 중 하나가 "생일"인 경우, 엔진(110)은 다른 이벤트(예를 들어, "결혼", "자녀 생일" 등)에 대한 "생일"("연령"으로도 알려짐)로부터의 오프셋을 연산할 수 있다.For an automatically detected time series data set, the user can accept the suggested time index, choose from any automatically detected alternatives, or specify a custom time index, which can also be used to use the custom time index. You may need to specify how to transform the variable. Once the temporal indices are accepted, selected or specified, the other temporal variables can be transformed into offsets from each other and used as predictive features. For example, if the data set uses mm/dd/yyyy as the time index and one of the variables is “birthday”, the engine 110 may trigger other events (eg, “marriage”, “child’s birthday”, etc.). ) from "birthday" (also known as "age").

사용자가 시계열 데이터 세트를 확인 또는 지시하고 시간 인덱스를 확인, 선택 또는 특정한 후에, 엔진(110)은 (예를 들어, 방법(400)의 단계(408)에서) 데이터 세트를 평가하고 (예를 들어, 방법(400)의 단계(410)에서) 이 평가를 사용자에게 제시한다. 이 표현은 시간에 따라 타겟 및 예측자 변수값을 개별적으로 또는 그룹별로 나타내는 도면을 포함할 수 있다. 일부 실시예에서, 표현은 하나 이상의 시간 래그에서 과거 값에 대한 각 시계열 변수의 현재값의 의존성을 나타낸다. 데이터 세트가 시계열과 단면 변수의 혼합을 포함하는 패널 데이터 세트인 경우, 표현은 같은 시기간에서 또는 서로 다른 래그에서 단면 변수들 간의 종속성을 나타낼 수 있다.After the user identifies or indicates a time series data set and identifies, selects or specifies a time index, the engine 110 evaluates the data set (eg, at step 408 of the method 400 ) and (eg, , at step 410 of method 400) present this rating to the user. This representation may include a diagram showing target and predictor variable values over time, individually or in groups. In some embodiments, the expression represents the dependence of the present value of each time-series variable on past values in one or more time lags. If the data set is a panel data set containing a mixture of time series and cross-sectional variables, the representation can represent dependencies between cross-sectional variables at the same time period or at different lags.

사용자는 트레이닝 윈도우의 끝(예를 들어, 트레이닝에 사용된 데이터의 시간 범위)과 검증 윈도우의 시작(예를 들어, 검증에 사용된 데이터의 시간 범위) 또는 홀드아웃 윈도우(예를 들어, 홀드아웃 테스트에 사용된 데이터의 시간 범위) 사이의 갭인 데이터에서 "스킵 범위"를 나타낼 수 있다. 일부 경우에, 최종 이력 관측치와 최초 예측 사이의 지연에 대한 운영 또는 물류 상의 이유가 존재한다. 슈퍼마켓 예에서, 지난 주에 대한 데이터는 일요일에 본사에 모두 도달하여 완벽하게 깨끗할 수 있다. 그러나, 예측 모델이 완료된 후에도 운영상 모든 매장 위치에 대한 예측 정보를 전달하고 매장이 예측에 응답하여 운영을 조정하는 데 며칠이 걸릴 수 있다. 마찬가지로, 이력 데이터를 수신하거나 클리닝하는 데 지연이 있을 수 있다. 어느 경우에도, 실용적인 목적은 최종의 이용 가능한 이력 관측치로부터 특정 갭 이후에 예측을 개시하는 것일 수 있다. 스킵 범위는, 엔진(110)이 그 자리에서 이 갭을 갖는 후보 예측 모델을 평가할 수 있게 한다.The user can select the end of the training window (e.g., the time range of data used for training) and the beginning of the validation window (e.g. the time range of the data used for verification) or holdout window (e.g., holdout). It can represent a "skip range" in data that is a gap between the time spans of the data used for testing. In some cases, there are operational or logistical reasons for the delay between the last historical observation and the initial prediction. In the supermarket example, data for the past week could all arrive at headquarters on a Sunday and be perfectly clean. However, even after the predictive model is complete, it can take several days for operations to deliver predictive information for all store locations and for stores to adjust operations in response to forecasts. Likewise, there may be a delay in receiving or cleaning historical data. In either case, the practical purpose may be to initiate predictions after a certain gap from the last available historical observation. Skip ranges allow engine 110 to evaluate candidate predictive models with this gap in situ.

사용자는 원하는 예측 범위(예를 들어, 모델에 의해 예측될 장래 시기간의 수 또는 모델에 의해 예측될 별개의 장래 이벤트의 수)를 나타낼 수 있다. 일부 경우에, 이 범위는 예를 들어, 다음 분기 매출과 같이 단지 하나의 관측치일 수 있다. 다른 것 중에서, 슈퍼마켓 체인의 예와 같이, 범위는 몇몇 장래 시기간을 포함할 수 있으며, 이 예에서는 42일이다.The user may indicate a desired prediction range (eg, the number of future time periods to be predicted by the model or the number of distinct future events to be predicted by the model). In some cases, this range may be just one observation, for example next quarter sales. Among other things, as in the example of the supermarket chain, the range may include several future time periods, which in this example is 42 days.

일부 경우, 엔진(110)은 데이터의 빈도 및/또는 데이터의 총 시기간 수에 기초하여 예상 범위를 제시할 수 있다. 예를 들어, 일일 데이터 및 비교적 적은 수의 시기간으로, 엔진(110)은 7일 예상을 제시할 수 있는 반면, 비교적 많은 수의 시기간으로, 30일 예상을 제안할 수 있다. 월별 데이터의 경우, 엔진(110)은 시기간의 수에 따라 3개월, 6개월 또는 12개월 예상을 제시할 수 있다. 분기별 데이터로, 엔진(110)은 4분기, 8분기 또는 12분기 예상을 제시할 수 있다. 연간 데이터의 경우, 엔진(110)은 5년, 10년 또는 20년 예상을 제시할 수 있다.In some cases, engine 110 may present an expected range based on the frequency of the data and/or the total number of periods of time in the data. For example, with daily data and a relatively small number of time periods, engine 110 may present a 7-day forecast, while with a relatively large number of time periods, it may offer a 30-day forecast. For monthly data, engine 110 may present 3-month, 6-month, or 12-month projections depending on the number of time periods. With quarterly data, the engine 110 may present a forecast for the fourth quarter, the eighth quarter, or the twelfth quarter. For annual data, engine 110 may present a five-year, ten-year, or twenty-year forecast.

일부 경우에, 엔진(110)은 (예를 들어, 사용자가 비정상적으로 짧은 예상 범위를 특정한 경우) 예상 범위의 관측치 범위 또는 배수(예를 들어, 논리적 배수)와 동등한 검증 범위를 사용한다.In some cases, engine 110 uses a verification range equivalent to an observation range or multiple (eg, a logical multiple) of the expected range (eg, when the user specifies an unusually short expected range).

시계열 데이터의 경우, 엔진(110)은 트레이닝 범위의 세트, 스킵 범위에 의해 오프셋된 대응하는 검증 범위 세트의 세트 및 홀드아웃 범위로 (예를 들어, 방법(400)의 단계(418 및 422)에서) 크로스 검증 및 홀드아웃을 구현할 수 있다. 엔진(110)은 이용 가능한 데이터의 양에 따라 조정할 수 있는 트레이닝 및 검증 범위의 디폴트 타겟 수를 가질 수 있다. 상대적으로 적은 시기간을 갖는 데이터 세트는 더 적은 트레이닝 및 검증 범위로 분할될 수 있으며, 상대적으로 많은 시기간을 갖는 데이터 세트는 더 많은 트레이닝 및 검증 범위로 분할될 수 있다.For time-series data, engine 110 generates a set of training ranges, a set of corresponding verification ranges offset by a skip range, and a holdout range (e.g., at steps 418 and 422 of method 400). ) can implement cross-validation and holdout. Engine 110 may have a default target number of training and verification ranges that can be adjusted depending on the amount of data available. A data set with a relatively small period of time may be partitioned into fewer training and validation spans, and a data set with a relatively large period of time may be partitioned into more training and validation spans.

데이터의 크기, 데이터의 빈도, 스킵 범위의 길이, 및/또는 예상 범위에 따라, 엔진(110)은 트레이닝 범위에 대한 길이를 선택할 수 있다. 예를 들어, 일일 관측치및 스킵 및 예상 범위가 주, 월 또는 년(또는, 대응하여 7, 30 및 365와 같은 일의 배수)으로 표시되면, 엔진은 주, 월 또는 년의 전체 수의 트레이닝 범위를 선택할 수 있다. 이러한 트레이닝 윈도우의 길이는 총 관측치 수, 총 데이터 양, 시간 경과에 따른 타겟 및 예측자 변수의 총 변동량, 변수에 의해 나타나는 계절적 변동량, 상이한 시간 윈도우에 대한 이 변수의 변동성의 일관성, 및/또는 예상 기간의 타겟 길이에 따를 수 있다. 엔진(110)은 트레이닝 윈도우의 길이(예를 들어, 최적 길이)를 선택하기 위해 이들 요인을 가중화하는 내부 목적 함수를 사용할 수 있다. 예를 들어, 디폴트는 전체 데이터 세트에 대해 균등하게 분할되어진 5개의 트레이닝 및 검증 범위일 수 있다. 그러나, 보다 긴 시기간 상에서의 변동량이 낮으면, 엔진(110)은 시간 윈도우를 단축시킬 수 있다. 또는, 데이터에 연간 계절성이 있는 경우, 엔진(110)은 단지 3개의 윈도우를 사용하여 수 년간의 데이터를 각 범위에 배치한다. 또는, 데이터 세트 내에 높은 변동을 나타내는 특정 기간이 수 개인 경우, 엔진(110)은 각 윈도우가 이들 기간 중 하나를 포함하도록 데이터를 분할할 수 있다. 목적 함수의 선택은 이전에 사용된 목적 함수의 공간에 대한 메타-머신 학습과 다양한 특성을 갖는 예측 문제에 대한 정확도에 의해 결정될 수 있다.Depending on the size of the data, the frequency of the data, the length of the skip range, and/or the expected range, the engine 110 may select a length for the training range. For example, if daily observations and skips and expected ranges are expressed in weeks, months, or years (or correspondingly multiples of days such as 7, 30, and 365), the engine will display the training range for the entire number of weeks, months, or years. can be selected. The length of this training window depends on the total number of observations, the total amount of data, the total amount of variance of the target and predictor variables over time, the amount of seasonal variance exhibited by the variable, the consistency of the variance of these variables over different time windows, and/or the expected It may depend on the target length of the period. The engine 110 may use an internal objective function that weights these factors to select a length (eg, an optimal length) of the training window. For example, the default may be 5 training and verification ranges that are equally divided over the entire data set. However, if the amount of variation over a longer period of time is low, the engine 110 may shorten the time window. Alternatively, if there is an annual seasonality in the data, the engine 110 uses only three windows to place several years of data into each range. Alternatively, if there are several specific time periods in the data set that exhibit high variability, the engine 110 may partition the data such that each window includes one of these time periods. The selection of the objective function can be determined by the meta-machine learning of the previously used space of the objective function and the accuracy of the prediction problem with various characteristics.

원하는 수의 트레이닝 및 검증 범위와 이들 범위의 길이에 스킵 범위를 더한 길이로, 엔진(110)은 데이터 세트를 트레이닝 및 검증 범위의 일관된 시리즈로 분할할 수 있다. 패널 데이터의 경우, 각 트레이닝 및 검증 쌍은 (예를 들어, 섹션 관측치를 폴드에 랜덤하게 할당함으로써) 폴드로 추가로 분할될 수 있다. 슈퍼마켓 예에서, 트레이닝 범위는 30주, 스킵 범위는 1주, 검증 범위는 6주이고, 3년 동안 근사적으로 4개의 트레이닝 세트를 산출할 수 있다. 그러나 10,000개의 매장이 있기 때문에, 성능 향상을 위해 후술하는 바와 같이 대로 이 매장은 추가로 "다운-샘플링"될 수 있다. 또한, 엔진은 모델 하이퍼-파라미터 및 혼합 모델을 튜닝하는 데 사용할 트레이닝 및 검증 윈도우 내의 서브-윈도우를 예약할 수 있다.With the desired number of training and verification ranges, plus the length of these ranges plus the skip ranges, the engine 110 may partition the data set into a coherent series of training and verification ranges. For panel data, each training and validation pair can be further split into folds (eg, by randomly assigning section observations to folds). In the supermarket example, the training range is 30 weeks, the skip range is 1 week, and the validation range is 6 weeks, which can yield approximately 4 training sets over 3 years. However, since there are 10,000 stores, these stores can be further "down-sampled" as described below to improve performance. The engine may also reserve sub-windows within the training and validation windows to use to tune model hyper-parameters and mixed models.

일부 실시예에서, 홀드아웃 데이터는 최종 시간 윈도우에만 있다. 그러나, 데이터 세트가 패널 데이터이고, 다운-샘플링된 경우, 홀드아웃 데이터는 다른 데이터와 동일한 기간으로부터의 데이터일 수 있지만, 다른 중첩되지 않는 샘플로부터의 데이터일 수 있다.In some embodiments, the holdout data is only in the last time window. However, when the data set is panel data and down-sampled, the holdout data may be data from the same period as the other data, but may be data from other non-overlapping samples.

단면 모델(예를 들어, 방법(300)의 단계(350)의 설명 참조)과 같이, 엔진(110)은 데이터 세트를 통해 반복할 수 있으며, 트레이닝 윈도우의 작은 부분에 대해 각각의 모델을 트레이닝하고, 그 부분에 대해 그 성능을 평가하고, 그 성능에 기초하여 추가 데이터에 대한 모델을 계속 테스트할지 여부를 결정할 수 있다. 시계열 데이터의 경우, 각 부분은 트레이닝 윈도우의 최종 관측에서 종료될 수 있으며, 초기 부분은, 그 부분 트레이닝 윈도우가 검증 윈도우의 논리적 배수가 되도록 시작할 수 있으며, 더 큰 부분은 더 큰 배수를 사용할 수 있다. 예를 들어, 주 단위로 측정된 검증 윈도우는 트레이닝 윈도우의 종료 4주 전에 시작하는 제1 부분, 8주 전에 시작하는 제2 부분, 12주 전에 시작하는 제3 부분을 사용할 수 있다. 월 단위로 측정된 검증 윈도우는 트레이닝 윈도의 종료 3월 전에서 시작하는 제1 부분, 6월 전에 시작하는 제2 부분, 9월 전에 시작하는 제3 부분, 12월 전에 시작하는 제4 부분 등을 사용할 수 있다. 년 단위로 측정되는 검증 윈도우는 트레이닝 윈도우의 종료 4년 전에 시작하는 제1 부분, 8년 전에 시작하는 제2 부분, 12년 전에 시작하는 제3 부분을 사용할 수 있고, 대안적으로, 부분 기간은 트레이닝 윈도우의 종료 전 5, 10 및 15년일 수 있다. 부분은 선형적으로(예를 들어, 3, 6, 9, 12 기간 또는 4, 8, 12, 16 기간) 또는 기하학적으로(예를 들어, 3, 6, 12, 24 기간 또는 4, 8, 16, 32 기간) 증가할 수 있다. 문제 도메인 및/또는 데이터 분석에 기초한 특이한 스케줄은 부분의 지수적 증가도 가능하다(예를 들어, 3, 6, 24, 192 기간 또는 4, 8, 32, 256 기간).As with a cross-sectional model (see, for example, the description of step 350 of method 300), engine 110 may iterate through a data set, training each model over a small portion of a training window, and , can evaluate its performance against that part, and decide whether to continue testing the model on additional data based on that performance. For time series data, each part may end at the last observation of the training window, the initial part may start such that the partial training window is a logical multiple of the validation window, and larger parts may use larger multiples. . For example, the verification window measured in weeks may use a first portion starting 4 weeks before the end of the training window, a second portion starting 8 weeks before, and a third portion starting 12 weeks before the end of the training window. The verification window, measured in months, has a first part starting March before the end of the training window, a second part starting before June, a third part starting before September, and a fourth part starting before December. Can be used. The validation window, measured in years, may use a first part starting 4 years before the end of the training window, a second part starting 8 years ago, and a third part starting 12 years ago; alternatively, the partial duration is It may be 5, 10 and 15 years before the end of the training window. A portion may be linear (eg 3, 6, 9, 12 period or 4, 8, 12, 16 period) or geometrically (eg 3, 6, 12, 24 period or 4, 8, 16 period) , 32 periods) can be increased. A specific schedule based on problem domains and/or data analysis is also possible with exponential growth of parts (eg 3, 6, 24, 192 periods or 4, 8, 32, 256 periods).

실행 속도를 향상시키고 자원 소비를 감소시키기 위해, 엔진(110)은 디폴트 다운-샘플링을 제안할 수 있다. 다운-샘플링에는 연대순 및 단면의 2개의 유형이 있다. 연대순 다운-샘플링에서, 각 분할 내의 전체 관측치 수는 고정된 퍼센티지로 감소될 수 있거나, 더 긴 시간 분해능으로 집계될 수 있다. 예를 들어, 100억 회의 관측치를 포함하는 시간 윈도우는 천만 회의 관측치로 감소될 수 있다. 또는, 매 밀리초마다 관측치가 있는 데이터의 윈도우는 매 분마다 집계된 관측치를 가질 수 있다. 이 경우, 다운-샘플링 윈도우 내의 모든 관측치는 집계된 관측에 대한 단일값으로 축소될 수 있다. 이 집계값은 예를 들어, 시작값, 종료값, 최대 극값, 평균값 또는 몇몇 변동 특성을 설명하기 위해 가중화 함수를 통해 연산된 값일 수 있다. 엔진(110)은 대응 시간 윈도우 내의 타겟 및 예측자의 변동성에 의한 관측치의 잠재적인 감소를 가중화하는 목적 함수를 사용할 수 있다. 예를 들어, 관측치의 1000x 감소는 변동성에서 5%로 자리 맞춤(justify)될 수 있다. 목적 함수의 선택은 이전에 사용된 목적 함수의 공간에 대한 메타-머신 학습과 다양한 특성을 갖는 예측 문제에 대한 정확도에 의해 추가로 결정될 수 있다.To improve execution speed and reduce resource consumption, engine 110 may suggest default down-sampling. There are two types of down-sampling: chronological and cross-sectional. In chronological down-sampling, the total number of observations within each partition can be reduced to a fixed percentage, or aggregated with a longer temporal resolution. For example, a time window containing 10 billion observations may be reduced to 10 million observations. Alternatively, a window of data with observations every millisecond may have observations aggregated every minute. In this case, all observations within the down-sampling window can be reduced to a single value for the aggregated observation. This aggregate value may be, for example, a starting value, an ending value, a maximum extreme value, an average value, or a value calculated through a weighting function to account for some variation characteristic. The engine 110 may use an objective function to weight the potential reduction in observations due to variability of the target and predictor within the corresponding time window. For example, a 1000x reduction in observations can be justified to 5% in variability. The choice of the objective function can be further determined by meta-machine learning on the space of the objective function used previously and the accuracy for prediction problems with various characteristics.

연대순 다운-샘플링은, 예를 들어, 데이터의 변동이 데이터의 주파수보다 낮은 극도로 높은 주파수 데이터에 대해 사용될 수 있다. 일반적으로 단면 다운-샘플링은 시간이 지남에 따라 그 거동이 크게 변하지 않는 매우 많은 수의 섹션이 있을 수 있기 때문에 보다 통상적이다. 슈퍼마켓 체인의 예의 경우, 각 매장 위치마다 하나씩 10,000개의 섹션이 있다. 변동의 대부분은 매장에 걸친 상대적으로 낮은 단면 변동과 함께 연대순 요인으로 인한 것일 수 있다. 예를 들어, 유사한 매장은 실제로 훨씬 적은 수의 대표 그룹에 속할 수 있다. 따라서, 엔진(110)은 섹션의 랜덤 서브샘플, 이 경우에는 저장 위치를 단지 구성하고, 하나 이상의 서브샘플에 대한 예측 모델을 트레이닝시킬 수 있다. 예를 들어, 10,000개의 모든 매장을 사용하는 대신에, 엔진(110)은 1,000개의 매장을 랜덤하게 선택하여 트레이닝할 수 있다. 또한, 크로스 검증이 단면 차원을 따라 사용되는 경우, 엔진(110)은 이러한 다운-샘플링을 적용할 수 있다. 예를 들어, 5-폴드 교차-검증의 경우, 엔진은 2,000개의 매장의 5 폴드 대신 400개 매장의 5 폴드를 사용할 수 있다. 다운-샘플링의 퍼센티지는 샘플의 다중 특성의 균형을 맞추거나 사용자에 의해 특정되는 목적 함수에 의해 결정되는 섹션 수의 함수에 의해 결정될 수 있다. 일부 경우에, 많은 상이한 예측 문제에 대한 메타-머신 학습에 의해 다운-샘플링 퍼센티지의 선택이 정제될 수 있다.Chronological down-sampling may be used, for example, for extremely high frequency data where the variation of the data is lower than the frequency of the data. In general, cross-sectional down-sampling is more common because there can be a very large number of sections whose behavior does not change significantly over time. For the example of a supermarket chain, there are 10,000 sections, one for each store location. Most of the variability may be due to chronological factors, with relatively low cross-sectional variability across stores. For example, similar stores may actually belong to a much smaller number of representative groups. Accordingly, the engine 110 may merely construct a random subsample of a section, in this case a storage location, and train a predictive model for one or more subsamples. For example, instead of using all 10,000 stores, the engine 110 may randomly select and train 1,000 stores. Also, if cross-validation is used along the cross-sectional dimension, the engine 110 may apply such down-sampling. For example, for 5-fold cross-validation, the engine may use 5 folds of 400 stores instead of 5 folds of 2,000 stores. The percentage of down-sampling may be determined by balancing multiple characteristics of the sample or as a function of the number of sections determined by a user-specified objective function. In some cases, the selection of down-sampling percentages may be refined by meta-machine learning for many different prediction problems.

시계열 모델로부터 정확한 예측을 생성하는 한가지 과제는, 이들 예측이 트레이닝 및 검증 시간 윈도우의 선택에 민감할 수 있다는 것이다. 일부 실시예에서, 엔진은 이 감도를 자동으로 평가한다. 예를 들어, 엔진(110)은 실행될 때 모든 모델링 기술에 대한 시간 윈도우 선택에 대한 감도를 평가할 수 있다. 다른 예로서, 엔진(110)은 모델이 예측 정확도의 특정 임계값을 초과한 후에 이 감도를 평가할 수 있다. 세번째 옵션은 상대적인 예측 정확도에 기초하여 톱 모델의 감도를 평가하는 것이다. 네번째 옵션은 사용자가 요청할 때 요구에 따라 감도 분석을 실행하는 것이다. 감도 분석은 모델을 슬라이딩 트레이닝 및 검증 윈도우로 맞춤화한 후 윈도우에 포함된 포인트들에 따라 모델 정확도가 어떻게 변하는지 측정하는 것을 포함할 수 있다. 예를 들어, 그래프는 수직 축에 모델 정확도를 플롯팅하고 수평 축에 트레이닝 윈도우의 시작 관측치를 플롯팅할 수 있다.One challenge in generating accurate predictions from time series models is that these predictions may be sensitive to the choice of training and validation time windows. In some embodiments, the engine automatically evaluates this sensitivity. For example, the engine 110, when executed, may evaluate its sensitivity to time window selection for all modeling techniques. As another example, the engine 110 may evaluate this sensitivity after the model exceeds a certain threshold of prediction accuracy. A third option is to evaluate the sensitivity of the top model based on its relative prediction accuracy. A fourth option is to run a sensitivity analysis on demand when the user requests it. Sensitivity analysis may include fitting the model to a sliding training and validation window and then measuring how the model accuracy changes depending on the points included in the window. For example, the graph may plot model accuracy on the vertical axis and the starting observations of the training window on the horizontal axis.

일단 엔진이 시계열 모델의 맞춤을 완료하고 이를 사용자에게 제시하면(예를 들어, 방법(400)의 단계(446)), 사용자는 모델 성능에 대한 상이한 트레이닝 데이터 윈도우의 영향을 추가로 탐구할 수 있다. 사용자 인터페이스는, 사용자가 특정 맞춤 모델 또는 맞춤 모델의 그룹을 선택하고, 트레이닝 및 검증 윈도우를 조정하도록 허용할 수 있어, 홀드아웃 윈도우에 있지 않은 임의의 관측치를 포함할 수 있다. 엔진이 하이퍼-파라미터 튜닝을 위해 서브-윈도우를 사용하지 않은 경우, 모델의 원래 맞춤 중에 계산된 최적의 하이퍼-파라미터가 새로운 윈도우로 모델을 다시 맞춤화할 때 사용될 수 있다. 네스팅된 트레이닝 및 검증 윈도우가 이용 가능한 경우, 엔진은 예약된 윈도우에서 최적의 하이퍼-파라미터를 자동으로 재계산하고 및/또는 최적의 하이퍼-파라미터를 재계산할지 또는 원래 계산된 것을 사용할지 여부를 사용자가 선택하도록 허용할 수 있다.Once the engine completes fitting the time series model and presents it to the user (eg, step 446 of method 400 ), the user can further explore the impact of different training data windows on model performance. . The user interface may allow the user to select a particular custom model or group of custom models, and adjust the training and validation windows, including any observations that are not in the holdout window. If the engine did not use sub-windows for hyper-parameter tuning, the optimal hyper-parameters calculated during the original fitting of the model can be used when refitting the model with the new window. If nested training and validation windows are available, the engine automatically recalculates the optimal hyper-parameters in the scheduled window and/or decides whether to recalculate the optimal hyper-parameters or use the original computed ones. You can allow the user to choose.

사용자가 홀드아웃 결과를 검토한 후에(예를 들어, 방법(400)의 단계(450)), 사용자는 트레이닝, 검증 및 홀드아웃 윈도우로부터의 데이터의 임의의 조합을 사용하여 모델의 임의의 서브세트를 다시 맞춤화할 수 있다. 홀드아웃 윈도우로부터의 데이터는 통상적으로 가장 최근 데이터이므로, 모델의 맞춤에서 이를 포함시키는 것은 장래 값을 예측하는 데 있어 모델의 정확도를 향상시킬 수 있다. 따라서, 새로운 예측자를 만들기 위해 모델을 배치하기 전에, 그 최종 관측치가 홀드아웃 윈도우의 최종 관측치인 트레이닝 데이터에 대해 모델을 다시 맞춤화할 수 있다. 이러한 다시 맞춤화에 사용하기 위한 제1 관측치의 선택은 감도 분석 동안 계산된 트레이닝 윈도우의 크기(예를 들어, 최적 크기)에 의존할 수 있다. 사용자는 예측 문제의 감도 및/또는 도메인 지식에 대한 사용자의 분석에 기초하여 계산된 시작 포인트를 무시할 수 있다.After the user has reviewed the holdout results (eg, step 450 of method 400 ), the user can use any combination of training, validation, and data from the holdout window to any subset of the model. can be customized again. Since the data from the holdout window is typically the most recent data, including it in the fit of the model can improve the accuracy of the model in predicting future values. Thus, before deploying the model to make new predictors, it is possible to refit the model to the training data whose last observations are the last observations in the holdout window. The selection of the first observation to use for this re-customization may depend on the size (eg, optimal size) of the training window computed during the sensitivity analysis. The user may override the calculated starting point based on the user's analysis of domain knowledge and/or the sensitivity of the prediction problem.

시계열 모델링 기술의 일부 실시예Some examples of time series modeling techniques

도 9를 참조하면, 시계열 예측 모델링을 위한 방법(900)은 단계(910-980)를 포함할 수 있다. 단계(910)에서, 시계열 데이터가 얻어진다. 시계열 데이터는 하나 이상의 데이터 세트를 포함할 수 있다. 각각의 데이터 세트는 복수의 관측치를 포함할 수 있다. 각각의 관측치는 (1) 관측치와 연관된 시간의 표시 및 (2) 하나 이상의 변수의 값을 포함할 수 있다. 단계(920)에서, 시계열 데이터의 시간 간격이 결정된다. 단계(930)에서, 시계열 데이터의 하나 이상의 변수가 타겟으로서 식별된다. 선택적으로, 시계열 데이터의 하나 이상의 변수가 또한 피처로서 식별될 수 있다. 단계(940)에서, 시계열 데이터에 의해 표현된 예측 문제와 연관된 "예상 범위" 및 "스킵 범위"가 결정된다. 예상 범위는 타겟의 값이 예측되는 시기간의 지속 시간을 나타낼 수 있다. 스킵 범위는 예상 범위에서 가장 빠른 예측과 연관된 시간과, 예측 범위에서의 예측이 기초로 하는 최근 관측치와 연관된 시간 사이의 시간 래그를 나타낼 수 있다.Referring to FIG. 9 , a method 900 for time-series predictive modeling may include steps 910 - 980 . In step 910, time series data is obtained. Time series data may include one or more data sets. Each data set may include a plurality of observations. Each observation may include (1) an indication of time associated with the observation and (2) a value of one or more variables. In step 920, a time interval of the time series data is determined. At step 930 , one or more variables of the time series data are identified as targets. Optionally, one or more variables of the time series data may also be identified as features. In step 940, an “expected range” and a “skip range” associated with the prediction problem represented by the time series data are determined. The expected range may indicate a duration of time when the value of the target is predicted. The skip range may represent a time lag between the time associated with the earliest prediction in the prediction range and the time associated with the most recent observation on which the prediction in the prediction range is based.

단계(950)에서, 트레이닝 데이터가 시계열 데이터로부터 생성된다. 트레이닝 데이터는 적어도 하나의 데이터 세트의 관측치의 제1 서브세트를 포함한다. 관측치의 제1 서브세트는 관측치의 트레이닝-입력 및 트레이닝-출력 컬렉션을 포함한다. 트레이닝-입력 및 트레이닝-출력 컬렉션에서 관측치와 연관된 시간은 각각 트레이닝-입력 시간 범위 및 트레이닝-출력 시간 범위에 대응한다. 스킵 범위는 트레이닝-입력 시간 범위의 끝을 트레이닝-출력 시간 범위의 시작으로부터 분리한다. 트레이닝-출력 시간 범위의 지속 시간은 적어도 예상 범위만큼 길다. 단계(960)에서, 테스팅 데이터가 시계열 데이터로부터 생성된다. 테스팅 데이터는 적어도 하나의 데이터 세트의 관측치의 제2 서브세트를 포함한다. 관측치의 제2 서브세트는 관측치의 테스트-입력 및 테스트-유효성 검증 컬렉션을 포함한다. 테스트-입력 및 테스트-유효성 검증 컬렉션 내의 관측치와 연관된 시간은 각각 테스트-입력 시간 범위 및 테스트-유효성 검증 시간 범위에 대응한다. 스킵 범위는 테스트-입력 시간 범위의 끝을 테스트-유효성 검증 검사 시간 범위의 시작 부분에서 분리한다. 시험-검증 시간 범위의 지속 시간은 적어도 예상 범위만큼 길다. 단계(970)에서, 예측 모델이 트레이닝 데이터에 맞춤화된다. 단계(980)에서, 맞춤화된 모델이 테스팅 데이터에 대해 테스트된다.In step 950, training data is generated from time series data. The training data includes a first subset of observations in the at least one data set. The first subset of observations includes training-input and training-output collections of observations. The times associated with observations in the training-input and training-output collections correspond to the training-input time range and the training-output time range, respectively. A skip range separates the end of the training-input time range from the beginning of the training-output time range. The duration of the training-output time range is at least as long as the expected range. In step 960, testing data is generated from the time series data. The testing data includes a second subset of observations in the at least one data set. The second subset of observations includes a test-input and test-validation collection of observations. The times associated with the observations in the test-input and test-validation collections correspond to the test-input time span and the test-validation time span, respectively. The skip range separates the end of the test-input time range from the beginning of the test-validation check time range. The duration of the test-validation time span is at least as long as the expected span. In step 970, the predictive model is fitted to the training data. In step 980, the customized model is tested against the testing data.

단계(910)에서, 시계열 데이터가 획득된다. 시계열 데이터는 임의의 적절한 기술(센서를 사용하여 측정되거나, 통신 네트워크를 통해 수신되거나, 컴퓨터-판독 가능 매체 등으로부터 로드되는 것, 등)을 사용하여 임의의 적절한 소스로부터 얻어질 수 있다. 시계열 데이터는 하나 이상의 데이터 세트를 포함할 수 있으며, 그 각각은 하나 이상의 관측치를 포함할 수 있다. 데이터 세트는 전술한 바와 같이, 데이터의 각각의 "섹션"에 대응할 수 있다. 데이터 세트 내의 각각의 관측치는 그 관측치와 연관된 시간의 표시(예를 들어, 타임스탬프)를 포함할 수 있다. 관측치와 연관된 시간은, 관측치의 값이 측정, 보고, 수신될 때 등의 시간일 수 있다. 관측치와 연관된 "시간"은 날짜 및/또는 시간, 또는 임의의 다른 적절한 시간 데이터를 포함할 수 있다.In step 910, time series data is obtained. The time series data may be obtained from any suitable source using any suitable technique (measured using a sensor, received over a communication network, loaded from a computer-readable medium, etc.). Time series data may include one or more data sets, each of which may include one or more observations. A data set may correspond to each “section” of data, as described above. Each observation in the data set may include an indication (eg, a timestamp) of a time associated with that observation. The time associated with an observation may be the time when a value of the observation is measured, reported, received, etc. A “time” associated with an observation may include a date and/or time, or any other suitable temporal data.

단계(920)에서, 시계열 데이터의 시간 간격이 결정된다. 일부 실시예에서, 데이터의 시간 간격은 데이터와 연관된 메타데이터에 의해 명시적으로 나타내어진다. 일부 실시예에서, 데이터의 시간 간격은 데이터를 분석함으로써(예를 들어, 관측치와 연관된 시간 표시자의 시간 분해능 및/또는 연속 관측치와 연관된 시간들 사이의 간격에 기초하여) 결정된다.In step 920, a time interval of the time series data is determined. In some embodiments, time intervals of data are explicitly indicated by metadata associated with the data. In some embodiments, the time interval of the data is determined by analyzing the data (eg, based on the temporal resolution of a time indicator associated with the observation and/or the interval between times associated with successive observations).

데이터 세트에서의 관측치 사이의 시간 간격은 균일하거나 불균일할 수 있다. 데이터 세트의 관측치 사이의 시간 간격이 균일하면, 균일한 간격은 데이터 세트의 시간 간격일 수 있다. 데이터 세트의 관측치 간의 시간 간격이 불균일한 경우, 수정된 데이터 세트의 관측치 간의 시간 간격이 균일하도록 데이터 세트가 수정될 수 있다. 수정된 데이터 세트에 대한 시간 간격은, 예를 들어, 목적 함수(예를 들어, 가중화된 목적 함수)를 원래의 데이터 세트에 적용함으로써 결정될 수 있다. 목적 함수는, 예를 들어, (1) 불균일한 시간 간격들 각각을 나타내는 연속 관측치 쌍의 각각의 비율, 및/또는 (2) 불균일 시간의 지속 시간에 기초하여, 수정된 시간 간격을 결정할 수 있다. 일부 실시예에서, 수정된 시간 간격은 최단 공통 기간(예를 들어, 불균일 기간 각각의 정수배인 최단 기간)이다.The time interval between observations in a data set can be uniform or non-uniform. If the time interval between observations in the data set is uniform, then the uniform interval may be the time interval in the data set. If the time interval between observations in the data set is non-uniform, the data set may be modified such that the time interval between observations in the modified data set is uniform. The time interval for the modified data set may be determined, for example, by applying an objective function (eg, a weighted objective function) to the original data set. The objective function may determine a modified time interval, for example, based on (1) each proportion of a pair of consecutive observations representing each of the non-uniform time intervals, and/or (2) the duration of the non-uniform time. . In some embodiments, the modified time interval is the shortest common period (eg, the shortest period that is an integer multiple of each non-uniform period).

불균일 시간 간격을 갖는 원래의 데이터 세트는 원래의 데이터 세트의 관측치를 샘플링(예를 들어, 다운-샘플링) 및/또는 집계함으로써 수정된 균일 시간 간격을 갖는 수정된 데이터 세트로 변환될 수 있다. 원래의 데이터 세트의 관측치를 다운-샘플링하는 것은 원래의 데이터 세트의 관측치에 의해 커버되는 시기간의 수정된 시간 간격의 각 인스턴스에 대해: 시계열 데이터의 시간 간격의 인스턴스에 대응하는 시간과 연관된 원래의 데이터 세트의 모든 관측치를 식별하는 것, 식별된 관측치를 집계하여 집계 관측치를 생성하고, 집계 관측치를 수정된 데이터 세트에 삽입하는 것을 포함할 수 있다. 식별된 관측치의 세트를 집계하는 것은 집계 관측치에서 각 변수의 값을 (1) 식별된 관측치 중 가장 초기에 포함된 대응 변수값, (2) 가장 최근의 식별된 관측치에 포함된 대응 변수값, (3) 식별된 관측치 포함된 대응하는 변수값의 최대값, (4) 식별된 관측치에 포함된 대응하는 변수값의 최소값, (5) 식별된 관측치에 포함된 대응하는 변수값의 평균, (6) 식별된 관측치에 포함된 대응 변수값의 함수의 값으로 설정함으로써 수행될 수 있다.An original data set with non-uniform time intervals may be transformed into a modified data set with modified uniform time intervals by sampling (eg, down-sampling) and/or aggregating observations in the original data set. Down-sampling the observations in the original data set includes, for each instance of the modified time interval of the time period covered by the observation in the original data set: the original data associated with a time corresponding to the instance of the time interval in the time series data. It may include identifying all observations in the set, aggregating the identified observations to produce aggregate observations, and inserting the aggregate observations into the revised data set. Aggregating a set of identified observations involves calculating the value of each variable in the aggregated observations: (1) the value of the corresponding variable included in the earliest identified observation, (2) the value of the corresponding variable included in the most recent identified observation, ( 3) the maximum value of the corresponding variable value contained in the identified observation, (4) the minimum value of the corresponding variable value contained in the identified observation, (5) the average of the corresponding variable value contained in the identified observation, (6) This can be done by setting it to the value of a function of the corresponding variable value included in the identified observation.

시계열 데이터의 데이터 세트들 간의 시간 간격은 균일하거나 불균일할 수 있다. 데이터 세트들 사이의 시간 간격이 균일하다면, 그 균일한 간격은 시계열 데이터의 시간 간격일 수 있다. 데이터 세트들 사이의 시간 간격이 불균일하다면, 수정된 데이터 세트들 사이의 시간 간격이 균일하도록 하나 이상의 데이터 세트가 수정될 수 있다. 시계열 데이터의 수정된 시간 간격은 데이터 세트의 간격으로부터 선택되거나 임의의 다른 적절한 기술을 사용하여 결정될 수 있다. 예를 들어, 시계열 데이터의 수정된 시간 간격은, (1) 불균일 시간 간격의 각각을 나타내는 데이터 세트에 포함된 관측치의 각각의 비율, 및/또는 (2) 데이터 세트의 불균일 시간 간격의 지속 시간에 기초하여 목적 함수(즉, 가중화된 목적 함수)를 사용하여 계산될 수 있다. 일부 실시예에서, 시계열 데이터의 수정된 시간 간격은 가장 짧은 공통 기간(예를 들어, 데이터 세트의 불균일 시간 간격 각각의 정수배인 최단 기간)이다. 데이터 세트의 시간 간격을 수정하기 위한 기술의 일부 실시예가 위에 설명되었다.The time interval between data sets of time series data may be uniform or non-uniform. If the time interval between data sets is uniform, the uniform interval may be the time interval of time series data. If the time intervals between the data sets are non-uniform, one or more data sets may be modified such that the time intervals between the modified data sets are uniform. The corrected time interval of the time series data may be selected from intervals in the data set or determined using any other suitable technique. For example, the adjusted time interval of time series data is determined by (1) each proportion of observations included in the data set representing each of the non-uniform time intervals, and/or (2) the duration of the non-uniform time interval of the data set. based on the objective function (ie, a weighted objective function). In some embodiments, the corrected time interval of the time series data is the shortest common period (eg, the shortest period that is an integer multiple of each non-uniform time interval of the data set). Some embodiments of techniques for modifying time intervals of data sets have been described above.

방법(900)의 일부 실시예에서, 피처 엔지니어링이 시계열 데이터에 대해 수행될 수 있다. 이러한 피처 엔지니어링은 예를 들어, 시계열 데이터의 시간 간격이 결정되기 전 또는 후에 수행될 수 있다. 일부 실시예에서, 시계열 데이터에 대한 피처 엔지니어링을 수행하는 것은: 시간을 나타내는 값을 갖는 시계열 데이터에서 제1 변수를 식별하는 것; 제2 변수의 값들을 생성하는 것으로서, 제2 변수의 각각의 값은 제1 변수의 시간 값과 기준 시간 값 사이의 오프셋이고; 및 제2 변수를 시계열 데이터에 가산하는 것(예를 들어, 제2 변수의 값이 도출된 제1 변수의 값을 포함하는 관측치에 제2 변수의 각 값을 가산함)을 포함할 수 있다. 일부 실시예에서, 제1 변수는 시계열 데이터로부터 제거될 수 있다. 일부 실시예에서, 기준 시간은 이벤트의 날짜(예를 들어, 출생, 결혼, 학교 졸업, 고용주에 대한 고용 개시, 학교 졸엄, 특정 직책에서의 업무 개시 등)이다. 이 피처 엔지니어링 기술은 절대 시간값을 상대 시간값으로 변환하는 데 사용될 수 있으며, 이는 상이한 기간의 상이한 데이터 세트의 데이터에서 패턴(예를 들어, 엔티티의 연령과 관련된 패턴)의 식별을 매우 용이하게 할 수 있으므로, 이러한 패턴과 관련된 값의 정확한 예측을 매우 용이하게 할 수 있다.In some embodiments of method 900, feature engineering may be performed on time series data. Such feature engineering may be performed, for example, before or after the time interval of the time series data is determined. In some embodiments, performing feature engineering on the time series data includes: identifying a first variable in the time series data having a value representing time; generating values of a second variable, each value of the second variable being an offset between a time value of the first variable and a reference time value; and adding the second variable to the time series data (eg, adding each value of the second variable to an observation including the value of the first variable from which the value of the second variable was derived). In some embodiments, the first variable may be removed from the time series data. In some embodiments, the reference time is the date of the event (eg, birth, marriage, graduating from school, starting employment with an employer, graduating from school, starting work in a particular position, etc.). This feature engineering technique can be used to transform absolute time values into relative time values, which will greatly facilitate the identification of patterns (e.g., age-related patterns of entities) in data from different data sets of different time periods. Therefore, it is very easy to accurately predict a value related to such a pattern.

일부 실시예에서, 피처 엔지니어링을 수행하는 것은 시계열 데이터를 시간적으로 다운-샘플링하는 것을 포함한다. 시계열 데이터를 시간적으로 다운-샘플링하기 위한 몇몇 기술을 전술하였다. 시계열 데이터를 시간적으로 다운-샘플링하는 것은 시계열 데이터에 기초하여 시계열 모델링 기술의 효율성을 향상시킬 수 있다(예를 들어, 연산 자원 사용을 감소시킴).In some embodiments, performing feature engineering includes temporally down-sampling the time series data. Several techniques for temporally down-sampling time series data have been described above. Temporally down-sampling the time series data may improve the efficiency of a time series modeling technique based on the time series data (eg, reduce computational resource usage).

방법(900)의 일부 실시예에서, 그래픽 정보(예를 들어, 그래프 또는 차트)는 사용자 인터페이스를 통해 제시(예를 들어, 표시)될 수 있다. 그래픽 정보는 하나의 변수값의 변화와 다른 변수값의 상관된 변화 사이의 시간 래그를 나타낼 수 있다. 이러한 상관은 임의의 적절한 기술을 사용하여 검출될 수 있다.In some embodiments of method 900, graphical information (eg, graphs or charts) may be presented (eg, displayed) via a user interface. The graphical information may represent a time lag between a change in the value of one variable and a correlated change in the value of another variable. Such correlation may be detected using any suitable technique.

단계(930)에서, 시계열 데이터의 하나 이상의 변수가 타겟으로서 식별된다. 타겟은 예를 들어, 시계열 데이터와 연관된 메타데이터, 예측 문제에 대한 설명 및/또는 사용자 입력에 기초하여 식별될 수 있다. 선택적으로, 시계열 데이터의 하나 이상의 변수가 또한 피처로서 식별될 수 있다. 피처는 예를 들어, 시계열 데이터와 연관된 메타데이터, 예측 문제에 대한 설명, 및/또는 사용자 입력에 기초하여 식별될 수 있다.At step 930 , one or more variables of the time series data are identified as targets. A target may be identified, for example, based on metadata associated with the time series data, a description of the prediction problem, and/or user input. Optionally, one or more variables of the time series data may also be identified as features. A feature may be identified, for example, based on metadata associated with the time series data, a description of the prediction problem, and/or user input.

단계(940)에서, 시계열 데이터에 의해 표현된 예측 문제와 연관된 예상 범위가 결정된다. 예상 범위는 타겟의 값이 예측되는 기간의 지속 시간을 나타낼 수 있다. 예상 범위는 (1) 시계열 데이터의 시간 간격, (2) 시계열 데이터에 포함된 관측치의 수, (3) 관측치에 의해 시계열 데이터에 포함되는 시기간, (4) 마이크로초, 밀리초, 초, 분, 시간, 일, 주, 개월, 분기, 계절, 년, 십년, 백년 및 천년으로 구성된 그룹에서 선택된 자연 기간, (5) 사용자 입력 등에 기초하여 결정될 수 있다. 일부 실시예에서, 예상 범위는 시계열 데이터의 시간 간격의 정수배이다. 일반적으로, 시계열 데이터의 관측치 수가 증가함에 따라 예상 범위가 증가할 수 있다.In step 940, an expected range associated with the prediction problem represented by the time series data is determined. The expected range may indicate a duration of a period for which the value of the target is predicted. The expected range is (1) the time interval of the time series data, (2) the number of observations included in the time series data, (3) the period of time covered by the observations in the time series data, and (4) microseconds, milliseconds, seconds, and minutes. , hours, days, weeks, months, quarters, seasons, years, ten years, a natural period selected from the group consisting of a hundred years, and a thousand years, (5) a user input, and the like. In some embodiments, the expected range is an integer multiple of the time interval of the time series data. In general, the expected range can increase as the number of observations in time series data increases.

단계(940)에서, 시계열 데이터에 의해 표현된 예측 문제와 관련된 스킵 범위가 결정된다. 스킵 범위는 예상 범위에서 가장 빠른 예측과 연관된 시간과 예상 범위에서의 예측이 기초로 하는 최근 관측치와 연관된 시간 사이의 시간 래그를 나타낼 수 있다. 스킵 범위는 적어도 부분적으로, 시계열 데이터의 컬렉션에서의 대기 시간(latency), 시계열 데이터의 통신에서의 대기 시간, 시계열 데이터 분석에서의 대기 시간, 시계열 데이터의 분석의 통신에서의 대기 시간, 및/또는 시계열 데이터의 분석에 기초한 액션 구현의 대기 시간에 기초하여 결정될 수 있다. 이러한 대기 시간은 예를 들어, 사용자 입력 및/또는 시계열 데이터와 연관된 메타데이터에 기초하여 결정될 수 있다. 일부 실시예에서, 스킵 범위는 시계열 데이터와 연관된 메타데이터에 기초하여 결정되거나 사용자에 의해 특정된다.In step 940, a skip range associated with the prediction problem represented by the time series data is determined. The skip range may represent a time lag between the time associated with the earliest prediction in the expected range and the time associated with the most recent observation on which the prediction in the expected range is based. The skip range may be, at least in part, a latency in the collection of time series data, a latency in communication of the time series data, a latency in the analysis of time series data, a latency in the communication of the analysis of the time series data, and/or It may be determined based on the waiting time of the action implementation based on the analysis of the time series data. This waiting time may be determined, for example, based on user input and/or metadata associated with time series data. In some embodiments, the skip range is determined based on metadata associated with the time series data or specified by the user.

단계(950)에서, 트레이닝 데이터가 시계열 데이터로부터 생성된다. 트레이닝 데이터는 적어도 하나의 데이터 세트의 관측치의 제1 서브세트를 포함한다. 관측치의 제1 서브세트는 관측치의 트레이닝-입력 및 트레이닝-출력 컬렉션을 포함한다. 트레이닝-입력 및 트레이닝-출력 컬렉션 내의 관측치와 연관된 시간은 각각 트레이닝-입력 시간 범위 및 트레이닝-출력 시간 범위에 대응한다. 스킵 범위는 트레이닝-입력 시간 범위의 끝을 트레이닝-출력 시간 범위의 시작으로부터 분리한다. 트레이닝-출력 시간 범위의 지속 시간은 적어도 예상 범위만큼 길다.In step 950, training data is generated from time series data. The training data includes a first subset of observations in the at least one data set. The first subset of observations includes training-input and training-output collections of observations. The times associated with observations in the training-input and training-output collections correspond to the training-input time range and the training-output time range, respectively. A skip range separates the end of the training-input time range from the beginning of the training-output time range. The duration of the training-output time range is at least as long as the expected range.

일부 실시예에서, 트레이닝-입력 시간 범위의 지속 시간은 시계열 데이터의 관측치의 총수, 변수 중 적어도 하나의 값의 시간에 따른 변화량, 적어도 하나의 변수의 값의 정기적인 변동의 양, 복수의 기간에 걸친 변수 중 적어도 하나의 변수의 값의 변동의 일관성, 및/또는 예상 범위의 지속 시간에 기초하여 결정된다.In some embodiments, the duration of the training-input time range is the total number of observations in the time series data, the amount of change over time in the value of at least one variable, the amount of periodic change in the value of the at least one variable, over a plurality of time periods. is determined based on the consistency of the variation in the value of at least one of the variables across, and/or the duration of the expected range.

일부 실시예에서, 모든 트레이닝 데이터에 대해 모델을 트레이닝하기보다는, 서브세트에 대한 예측 모델을 트레이닝할 목적으로 트레이닝 데이터의 서브세트가 식별된다. 트레이닝 데이터의 서브세트에 기초하여 모델을 트레이닝하는 것은 모든 트레이닝 데이터에 기초하여 모델을 트레이닝하는 것보다 적은 연산 자원(및 시간)을 사용할 수 있다. 일부 실시예에서, 트레이닝 데이터의 각 서브세트는 트레이닝-입력 시간 범위의 종료 시간에서 끝난다. 트레이닝 데이터 서브세트는 트레이닝-입력 시간 범위 내에서 상이한 시간에 시작될 수 있고 및/또는 상이한 샘플링 레이트로 트레이닝-입력 시간 범위에서 관측치를 샘플링할 수 있다. 트레이닝 데이터 서브세트의 지속 시간은 예상 범위의 지속 시간의 정수배일수 있다.In some embodiments, rather than training the model on all training data, a subset of training data is identified for the purpose of training a predictive model on the subset. Training a model based on a subset of training data may use less computational resources (and time) than training a model based on all training data. In some embodiments, each subset of training data ends at the end time of the training-input time range. The training data subsets may begin at different times within the training-input time range and/or may sample observations in the training-input time range at different sampling rates. The duration of the training data subset may be an integer multiple of the duration of the expected range.

일부 실시예에서, 트레이닝 데이터는 다운-샘플링될 수 있다(예를 들어, 시간적으로 다운-샘플링되거나 단면 다운-샘플링됨). 트레이닝 데이터는 다운-샘플링된 시간 간격을 선택하고 다운-샘플링된 시간 간격에 따라 트레이닝 데이터의 각각의 데이터 세트를 다운 샘플링함으로써(예를 들어, 전술한 기술들을 사용하여) 시간적으로 다운-샘플링될 수 있다. 트레이닝 데이터는 하나 이상의 데이터 세트를 트레이닝 데이터로부터 제거함으로써 단면 다운-샘플링될 수 있다. 일부 실시예에서, 트레이닝 데이터는 시간적으로도 다운-샘플링되고 단면으로도 다운-샘플링된다.In some embodiments, the training data may be down-sampled (eg, temporally down-sampled or sectional down-sampled). The training data may be temporally down-sampled by selecting a down-sampled time interval and down-sampling each data set of training data according to the down-sampled time interval (e.g., using the techniques described above). have. The training data may be single-sided down-sampled by removing one or more data sets from the training data. In some embodiments, the training data is down-sampled both temporally and cross-sectionally.

단계(960)에서, 테스팅 데이터가 시계열 데이터로부터 생성된다. 테스팅 데이터는 적어도 하나의 데이터 세트의 관측치의 제2 서브세트를 포함한다. 관측치의 제2 서브세트는 관측치의 테스트-입력 및 테스트-유효성 검증 컬렉션을 포함한다. 테스트-입력 및 테스트-유효성 검증 컬렉션 내의 관측치와 연관된 시간은 각각 테스트-입력 시간 범위 및 테스트-유효성 검증 시간 범위에 대응한다. 스킵 범위는 테스트-입력 시간 범위의 끝을 테스트-유효성 검증 시간 범위의 시작에서 분리한다. 테스트-유효성 검증 시간 범위의 지속 시간은 적어도 예상 범위만큼 길다.In step 960, testing data is generated from the time series data. The testing data includes a second subset of observations in the at least one data set. The second subset of observations includes a test-input and test-validation collection of observations. The times associated with the observations in the test-input and test-validation collections correspond to the test-input time span and the test-validation time span, respectively. The skip range separates the end of the test-input time range from the beginning of the test-validation time range. The duration of the test-validation time span is at least as long as the expected span.

일부 실시예에서, 테스트-입력 시간 범위의 지속 시간은 시계열 데이터의 전체 관측치수, 변수 중 적어도 하나의 값의 시간에 따른 변화량, 적어도 하나의 변수의 값의 정기적인 변화량, 복수의 기간에 걸친 변수들 중 적어도 하나의 변수의 값의 변화의 일관성, 및/또는 예상 범위의 지속 시간에 기초하여 결정된다.In some embodiments, the duration of the test-input time range is the total number of observations of the time series data, the amount of change over time in the value of at least one of the variables, the periodic amount of change in the value of the at least one variable, the variable over a plurality of time periods. the consistency of the change in the value of at least one of the variables, and/or the duration of the expected range.

일부 실시예에서, 테스팅 데이터의 서브세트는 모든 트레이닝 데이터에 대해 모델을 트레이닝하기보다는 서브세트에 대한 예측 모델을 테스트하기 위한 목적으로 식별된다. 테스팅 데이터의 서브세트에 대해 모델을 테스트하는 것은 모든 트레이닝 데이터에 대해 모델을 테스트하는 것보다 적은 연산 자원(및 시간)을 사용할 수 있다. 일부 실시예에서, 테스팅 데이터의 각 서브세트는 테스트-입력 시간 범위의 종료 시간에 종료한다. 테스팅 데이터 서브세트는 테스트-입력 시간 범위 내에서 다른 시간에 시작될 수 있고, 및/또는 상이한 샘플링 레이트로 테스트-입력 시간 범위에서 관측치를 샘플링할 수 있다. 테스팅 데이터 서브세트의 지속 시간은 예상 범위의 지속 시간의 정수배일 수 있다.In some embodiments, a subset of testing data is identified for the purpose of testing a predictive model on a subset rather than training the model on all training data. Testing the model on a subset of the testing data may use less computational resources (and time) than testing the model on all the training data. In some embodiments, each subset of testing data ends at the end time of the test-input time range. The testing data subset may be started at different times within the test-input time range, and/or may sample observations in the test-input time range at different sampling rates. The duration of the testing data subset may be an integer multiple of the duration of the expected range.

일부 실시예에서, 테스팅 데이터는 다운-샘플링(예를 들어, 시간적 다운-샘플링 및/또는 단면 다운-샘플링)될 수 있다. 시간 및 단면 다운-샘플링에 대한 일부 기술은 전술하였다.In some embodiments, the testing data may be down-sampled (eg, temporal down-sampled and/or sectional down-sampled). Some techniques for temporal and cross-sectional down-sampling have been described above.

단계(970)에서, 예측 모델이 트레이닝 데이터에 맞춤화된다. 단계(980)에서, 맞춤화된 모델은 테스팅 데이터에 대해 테스트된다. 상호 유효성 검증(네스팅된 상호 유효성 검증을 포함하되 이에 한정되지 않음) 및 홀드아웃 기술은 예측 모델을 맞춤화 및/또는 테스트하는 데 사용될 수 있다. 상호 유효성 검증을 위해, 시계열 데이터는 단면 및/또는 시간적으로 분할될 수 있다.In step 970, the predictive model is fitted to the training data. In step 980, the customized model is tested against testing data. Mutual validation (including but not limited to nested mutual validation) and holdout techniques may be used to customize and/or test predictive models. For mutual validation, time series data may be segmented in cross-section and/or in time.

예측 문제에 대한 예측 모델을 선택하기 위한 방법(300)의 일부 실시예가 시계열 예측 문제에 대한 시계열 예측 모델을 선택하는 데 사용될 수 있다. 일부 실시예에서, 시계열 데이터의 하나 이상의 피처의 모델-특유 예측값이 (예를 들어, 이하에서 설명되는 방법(1000)의 실시예를 사용하여) 결정될 수 있다. 일부 실시예에서, 시계열 예측 모델은 (예를 들어, 본원에 설명되는 기술을 사용하여) 혼합될 수 있다. 일부 실시예에서, (예를 들어, 본원에 설명되는 기술 및/또는 성능 향상을 사용하여) 시계열 예측 모델이 배치 및/또는 리프레시될 수 있다. 일부 실시예에서, 시계열 데이터의 2개 이상의 변수(예를 들어, 피처) 간의 상호 작용 강도가 (예를 들어, 본원에 설명되는 기술을 사용하여) 결정될 수 있다.Some embodiments of the method 300 for selecting a predictive model for a predictive problem may be used to select a time-series predictive model for a time-series predictive problem. In some embodiments, model-specific predictive values of one or more features of time series data may be determined (eg, using embodiments of method 1000 described below). In some embodiments, time series prediction models may be blended (eg, using techniques described herein). In some embodiments, the time series prediction model may be deployed and/or refreshed (eg, using techniques and/or performance enhancements described herein). In some embodiments, the strength of an interaction between two or more variables (eg, features) of time series data may be determined (eg, using techniques described herein).

유니버셜universal 피처 중요도 Feature importance

감독 머신 학습 문제를 고려할 때, 중요한 과제는 종종 어떤 피처가 타겟에 대해 가장 예측 가능한 값을 갖는지를 측정하는 것이다. 이러한 피처의 중요도를 측정하는 것은, (1) 일반적으로 예측 문제를 이해하고, (2) 특정의 맞춤화된 모델이 예측 결과를 어떻게 생성하는지를 이해하는, 예측 모델링의 2개의 개별 스테이지에서 유용할 수 있다.When considering supervised machine learning problems, an important challenge is often to measure which features have the most predictable values for the target. Measuring the importance of these features can be useful in two separate stages of predictive modeling: (1) understanding the prediction problem in general, and (2) understanding how a particular customized model produces predictive outcomes. .

목적 (1)을 위해, 피처 중요도의 메트릭은 데이터 세트의 평가(예를 들어, 방법(400)의 단계(408)), 사용자에 대한 평가의 제시(예를 들어, 방법(400)의 단계(410)), 사용자가 데이터 세트를 어떻게 정제하는지(예를 들어, 방법(400)의 단계(412)), 및/또는 어떤 모델링 기술이 시도하거나 시도를 제시하는지(예를 들어, 단계(422 및 424))를 알릴 수 있다. 목적 (2)를 위해, 피처 중요도의 메트릭은 혼합된 모델의 자동화 개발(예를 들어, 방법(400)의 단계(432))을 알릴 수 있고, 대안적인 모델의 상대적인 성능을 이해하도록 사용자를 도울 수 있다(예를 들어, 방법(400)의 단계(446)).For purpose (1), the metric of feature importance is an evaluation of the data set (eg, step 408 of method 400), presentation of a rating to a user (eg, step 408 of method 400) 410)), how the user refines the data set (e.g., step 412 of method 400), and/or which modeling technique attempts or presents an attempt (e.g., step 422 and 424)) can be reported. For purpose (2), the metric of feature importance can inform the automated development of the blended model (eg, step 432 of method 400), and help the user to understand the relative performance of alternative models. (eg, step 446 of method 400).

실제로, 제1 목적을 위해 피처 중요도를 결정하는 것은 일반적으로 특정 모델 또는 모델 패밀리(예를 들어, 랜덤 포레스트)의 컨텍스트 내에서 발생한다. 물론, 각 모델은 피처 중요도를 측정하기 위한 디바이스로서 이점과 단점을 가질 수 있다. 따라서 몇몇 상이한 유형의 모델을 사용하여 피처 중요도를 계산하는 것은 향상된 이해를 전달할 수 있다. 예를 들어, 랜덤 포리스트, 일반화된 가산 모델, 및 지원 벡터 머신은 머신 학습을 위한 근본적으로 상이한 유형의 모델이며 동일한 데이터 세트에 대해 동일한 피처의 중요도의 상이한 척도를 생성할 수 있다. 이러한 차이는 예측 문제의 구조에 대한 더 깊은 직관을 제공할 수 있으며 추가 탐색의 길을 제시한다. 목적 (2)를 위해, 피처 중요도는 일반적으로 고려 중인 모델(들)에특유하다.Indeed, determining feature importance for a first purpose generally occurs within the context of a particular model or family of models (eg, a random forest). Of course, each model may have advantages and disadvantages as a device for measuring feature importance. Thus, calculating feature importance using several different types of models can deliver improved understanding. For example, random forests, generalized additive models, and support vector machines are fundamentally different types of models for machine learning and can produce different measures of importance of the same feature for the same data set. These differences can provide a deeper intuition into the structure of the prediction problem and provide avenues for further exploration. For purpose (2), feature importance is generally specific to the model(s) under consideration.

임의의 모델에 대한 피처 중요도를 계산하기 위한 기술의 일부 실시예가 이하 설명된다.Some embodiments of techniques for calculating feature importance for an arbitrary model are described below.

탐색 엔진(110)은, 데이터 세트(또는 그 임의의 샘플) 및 모델링 기술이 주어지면, 유니버설 부분 의존성을 사용하여 임의의 피처의 중요도를 계산할 수 있다. 먼저, 엔진(110)은 모델링 기술을 사용하여 샘플에 맞춤화된 예측 모델에 대한 정확도 메트릭을 얻는다. 엔진(110)은 처음부터 이러한 맞춤을 수행하거나 이전 맞춤을 사용할 수 있다. 그 후, 주어진 피처에 대해, 엔진(110)은 모든 관측치에 걸쳐 모든 값을 취하고, 이들을 셔플링(shuffling)하고, 관측치에 이를 재할당한다(예를 들어, 랜덤하게 이를 재할당함). 이러한 랜덤 셔플링은 그 피처에 대한 임의의 예측값을 감소(예를 들어, 파기)시킬 수 있다. 그 후, 엔진은 셔플링된 피처 값으로 데이터 세트에 대해 모델을 리스코어링하여, 정확도 메트릭에 대한 새로운 값을 생성할 수 있다. (선택적으로, 셔플링된 데이터 세트에 대해 맞춤화된 모델을 리스코어링하기 전에, 엔진은 모델을 셔플링된 데이터 세트로 다시 맞춤화할 수 있음.) 모델의 정확도의 감소는, 예측값이 얼마나 많이 손실되었는지를 나타내며, 따라서 모델에 대한 그리고/또는 모델링 기술의 범위 내에서 피처의 중요도를 나타낸다.The search engine 110 may calculate the importance of any feature, given a data set (or any sample thereof) and modeling techniques, using universal partial dependencies. First, the engine 110 uses modeling techniques to obtain accuracy metrics for the predictive model customized to the sample. The engine 110 may perform this fit from scratch or may use a previous fit. Then, for a given feature, engine 110 takes all values across all observations, shuffles them, and reassigns them to the observations (eg, randomly reassigns them). Such random shuffling may reduce (eg, discard) any predictions for that feature. The engine can then re-score the model against the data set with the shuffled feature values to generate new values for the accuracy metric. (Optionally, before rescoring the customized model against the shuffled data set, the engine can refit the model back to the shuffled data set.) The decrease in the accuracy of the model is determined by how much the predictions are lost. , thus indicating the importance of the feature to the model and/or within the scope of the modeling technique.

모델링 기술 및/또는 하나의 모델에 대한 하나의 피처의 중요도를 계산하기 위한 전술한 기술을 사용하여, 엔진은 피처를 반복하여 모델 및/또는 모델링 기술 내의 피처의 상대적 중요도 결정하고, 모델 및/또는 모델링 기술 또는 모두에 걸쳐 피처의 상대적 중요도를 결정하기 위해 모델 및/또는 모델에 대해 반복할 수 있다.Using the modeling techniques and/or techniques described above for calculating the importance of a feature to a model, the engine iterates the features to determine the relative importance of the features in the model and/or modeling technique, and/or the model and/or It can be iterated over the model and/or model to determine the relative importance of features across modeling techniques or both.

목적 (1)을 충족시키기 위해, 엔진(110)은 일반적으로 피처 중요도에 대한 예시적인 결과를 생성하는 모델링 기술의 목록을 유지할 수 있다. 데이터 세트를 평가할 때, 엔진(110)은 이러한 모델링 기술의 전부 또는 일부를 자동으로 실행할 수 있다. 엔진은 데이터 세트의 속성을 기초로 실행할 모델링 기술을 선택할 수 있다. 사용자 인터페이스는 사용된 각각의 모델링 기술에 대해 개별적으로, 또는 사용된 모든 모델링 기술에 걸쳐 상대적으로 피처 중요도 값을 표시할 수 있다. 엔진(110)은 또한 사용자로 하여금 모델링 기술 라이브러리(130)로부터 임의의 모델링 기술을 선택하여 피처 중요도를 측정하는 데 사용하도록 허용할 수 있다.To meet objective (1), engine 110 may generally maintain a list of modeling techniques that produce exemplary results for feature importance. When evaluating the data set, the engine 110 may automatically execute all or some of these modeling techniques. The engine can choose which modeling technique to run based on the properties of the data set. The user interface may display feature importance values individually for each modeling technique used, or relative across all modeling techniques used. Engine 110 may also allow a user to select any modeling technique from modeling technique library 130 and use it to measure feature importance.

사용자가 예측 문제를 이해하는 것을 돕는 것 이외에, 엔진(110)은 추가적인 자동화된 분석을 가이드하기 위해 피처 중요도의 이러한 애플리케이션으로부터의 결과를 사용할 수 있다. 예를 들어, 전반적으로 중요도가 높게 스코어링된 피처의 경우, 엔진이 이러한 피처의 상호 작용을 탐색하는 데 더 많은 자원을 할당할 수 있다. 전반적으로 중요도가 낮게 스코어링된 피처의 경우, 엔진은 데이터 세트로부터 이러한 피처를 완전히 탈락시킬 수 있다. 일부 피처가 일부 모델 또는 모델링 기술에 대해 높은 중요도를 갖고 다른 모델 또는 모델링 기술에 대해 낮은 중요도를 갖는 낮은 경우, 엔진은 예측 모델에 대한 더 깊은 검색을 지시하고, 더 많은 모델링 기술을 시도하고 모델 공간 탐색 프로세스에서 더 많은 데이터를 더 일찍 사용할 수 있다.In addition to helping the user understand the prediction problem, the engine 110 may use the results from this application of feature importance to guide further automated analysis. For example, for features that score high in overall importance, the engine may allocate more resources to exploring the interactions of those features. In the case of features that score low in overall importance, the engine can completely exclude those features from the data set. If some features are low with high importance to some models or modeling techniques and low importance to other models or modeling techniques, the engine instructs a deeper search for the predictive model, tries more modeling techniques and model space More data is available earlier in the discovery process.

목적 (2)를 충족시키기 위해, 엔진은 맞춤화된 모델에 대한 피처 중요도 값을 자동으로 또는 요구에 따라 계산할 수 있다. 자동으로, 엔진(110)은 모든 모델 또는 그 부분에 대한 피처 중요도 값을 계산할 수 있다. 부분은 임의의 적절한 방식으로 계산될 수 있다. 예를 들어, 부분은 N개의 톱 수행 모델, 특정 성능 메트릭에 대해 최소한 일정 수준의 성능 임계값을 충족하는 모든 모델, 또는 그 성능 메트릭이 톱 모델의 성능의 주어진 부분 내에 있는 모든 모델을 포함할 수 있다. 일부 실시예에서, 엔진은 이용 가능한 연산 자원을 고려하여, 이들 자원에 비례하여 더 크거나 작게 되도록 부분을 조정할 수 있다.To meet objective (2), the engine may calculate feature importance values for the customized model automatically or on demand. Automatically, engine 110 may calculate feature importance values for all models or portions thereof. The portion may be calculated in any suitable manner. For example, a portion may include the N top performing models, all models that meet at least some level of performance threshold for a particular performance metric, or all models whose performance metric is within a given portion of the top model's performance. have. In some embodiments, the engine may take into account available computational resources and adjust portions to be larger or smaller in proportion to these resources.

일부 실시예에서, 사용자는 요구시에 피처 중요도 계산을 요청할 수 있다. 일부 경우에, 사용자는 주어진 모델에 대한 일부 또는 모든 피처의 중요도를 확인하기를 원할 수 있다. 다른 경우에, 사용자가 다른 모델에 걸쳐 일부 또는 모든 피처의 중요도를 확인하기를 원할 수 있다.In some embodiments, a user may request feature importance calculations on demand. In some cases, a user may wish to ascertain the importance of some or all features for a given model. In other cases, a user may wish to ascertain the importance of some or all features across different models.

일반적으로 예측 문제 또는 특정 모델에 대한 피처 중요도 값을 제공하기 위해 전술한 기술을 사용하는 것은 많은 이점을 가질 수 있다. 예를 들어:Using the techniques described above to provide feature importance values for prediction problems in general or specific models can have many advantages. for example:

(1) 피처가 일반적으로 또는 적어도 모든 정확한 모델에 있어서 유익하지 않다면, 그 피처에 대응하는 데이터의 수집이 중지될 수 있다. 일부 경우에 소스 위치로부터 데이터를 추출하는 노동 또는 심지어 데이터 공급자에게 지불하는 비용과 같이, 피처를 이용 가능하게 만드는 데 실제로 비용이 든다.(1) If a feature is not informative in general or at least in all accurate models, the collection of data corresponding to that feature may be stopped. In some cases, there is a real cost to making the feature available, such as labor to extract data from the source location or even the cost of paying the data provider.

(2) 피처 중요도와 측정된 피처 중요도의 사용자의 기대 차이가 추가적인 조사를 보증할 수 있다. 불일치를 설명하는 데이터 세트에 오류가 있는 것으로 판명될 수 있거나, 그 차이가 실제이며 예측 문제에 대한 새로운 직관을 제공하는 것으로 판명될 수 있다.(2) The difference between user expectations of feature importance and measured feature importance may warrant further investigation. The data set that accounts for the inconsistency may turn out to be erroneous, or it may prove that the differences are real and provide new intuitions for the prediction problem.

(3) 일부 경우에, 가능한 한 적은 피처를 사용하여 예측을 하는 모델을 생성하는 것이 바람직할 수 있다. 이러한 경우, 사용자는 특정 모델링 기술을 재실행할 수 있거나 모든 모델링 기술 중에서 가장 중요한 중요도의 N개 피처만을 사용하거나 특정 임계값을 초과하는 중요도 값을 갖는 피처만으로 검색을 재실행할 수 있다.(3) In some cases, it may be desirable to create a model that makes predictions using as few features as possible. In this case, the user can redo the specific modeling technique, use only the N features of the most important importance among all modeling techniques, or rerun the search with only those features with importance values that exceed a certain threshold.

(4) 어떤 피처가 중요한 도움인지 아는 것은 가장 중요한 피처를 변환하고 조합하는 다른 방식을 실험함으로써 사용자가 예측 모델을 향상시키는 것을 도울 수 있다.(4) Knowing which features are important aids can help users improve predictive models by experimenting with different ways of transforming and combining the most important features.

데이터 세트의 하나 이상의 피처의 of one or more features in the data set. 예측값을the predicted value 결정하기 위한 기술의 일부 part of the skill to decide 실시예Example

도 10은 초기 예측 문제를 나타내는 초기 데이터 세트의 하나 이상의 피처의 예측값(예를 들어, "중요도")을 결정하기 위한 방법(1000)을 나타낸다. 일부 실시예에서, 예측 모델링 시스템(100)은 데이터 세트의 평가 중에(예를 들어, 방법(400)의 단계(408)), 및/또는 데이터 세트를 프로세싱할 때(예를 들어, 방법(300)을 참조하여 전술한 바와 같음), 방법(1000)(또는 그 일부)을 수행할 수 있다. 일부 실시예에서, 방법(1000)은 임의의 예측 모델 또는 예측 모델링 기술에 대한 임의의 데이터 세트 피처의 예측값을 결정하는 데 사용될 수 있다.10 illustrates a method 1000 for determining predictive values (eg, “significance”) of one or more features of an initial data set representing an initial prediction problem. In some embodiments, the predictive modeling system 100 is configured during evaluation of the data set (eg, step 408 of method 400 ), and/or when processing the data set (eg, method 300 ). ), the method 1000 (or a portion thereof) may be performed. In some embodiments, method 1000 may be used to determine predictive values of any data set feature for any predictive model or predictive modeling technique.

단계(1010)에서, 시스템(100)은 복수의 예측 모델링 절차를 수행한다. 각 예측 모델링 절차는 예측 모델과 연관된다. 각 모델링 절차를 수행하는 것은 연관된 예측 모델을, 초기 예측 문제를 나타내는 초기 데이터 세트의 적어도 일부에 맞추는 것을 포함한다. 초기 데이터 세트는 이전 관측치를 포함하고, 각 관측치는 일반적으로 초기 데이터 세트의 피처 중 적어도 일부의 값을 포함한다.At step 1010, the system 100 performs a plurality of predictive modeling procedures. Each predictive modeling procedure is associated with a predictive model. Performing each modeling procedure includes fitting an associated predictive model to at least a portion of an initial data set representing the initial prediction problem. The initial data set contains previous observations, and each observation generally contains values of at least some of the features of the initial data set.

단계(1020)에서, 시스템(100)은 맞춤화된 예측 모델 각각의 제1 정확도 스코어를 결정한다. 맞춤화된 모델의 제1 정확도 스코어는, 맞춤화된 모델이 초기 예측 문제의 하나 이상의 결과를 예측하는 정확도를 나타낸다. 초기 데이터 세트의 홀드아웃 부분에 대해 모델을 테스트하는 것에 한정되지 않지만 이를 포함하여, 모델의 정확도를 결정하기 위한 임의의 적절한 메트릭 및/또는 기술이 사용될 수 있다.At step 1020 , the system 100 determines a first accuracy score of each of the customized predictive models. The first accuracy score of the customized model indicates the accuracy with which the customized model predicts one or more outcomes of the initial prediction problem. Any suitable metrics and/or techniques for determining the accuracy of a model may be used, including, but not limited to, testing the model against the holdout portion of the initial data set.

단계(1030)에서, 시스템(100)은 초기 데이터 세트 내의 관측치에 걸쳐 특정 피처 F의 값을 "셔플링"하고, 이에 의해 수정된 예측 문제를 나타내는 수정된 데이터 세트를 생성한다. 일부 실시예에서, 셔플링은 피쳐 F의 값을 원래 관측치로부터 상이한 관측치로 (예를 들어, 랜덤하게) 재할당함으로써 수행된다. 이 셔플링 동작은 피처 F에 대한 임의의 예측값을 감소(예를 들어, 파기)시킬 수 있다. 데이터 세트로부터 피처 F를 제거하거나 각 관측치에 피처 F에 대한 동일값을 할당하는 것에 한정되지 않지만 이를 포함하여, 데이터 세트 내의 피처 F의 예측값을 감소(예를 들어, 파기)하기 위한 다른 기술이 가능하다. 일부 실시예에서, 피처 F의 예측값을 임계값 미만으로 감소시키는 임의의 기술이 사용될 수 있다.At step 1030, the system 100 "shuffles" the values of a particular feature F over observations in the initial data set, thereby generating a modified data set representing the corrected prediction problem. In some embodiments, shuffling is performed by reallocating the value of feature F from the original observation to a different observation (eg, randomly). This shuffling operation may reduce (eg, discard) any predictions for feature F. Other techniques for reducing (e.g., discarding) the predicted value of feature F in a data set are possible, including, but not limited to, removing feature F from the data set or assigning each observation the same value for feature F do. In some embodiments, any technique that reduces the predicted value of feature F below a threshold may be used.

단계(1040)에서, 시스템(100)은 수정된 예측 문제에 대한 맞춤화된 예측 모델의 제2 정확도 스코어를 결정한다. 제2 정확도 스코어는, 맞춤화된 모델이 수정된 예측 문제의 하나 이상의 결과를 예측하는 정확도를 나타낸다. 수정된 데이터 세트의 홀드아웃 부분에 대해 모델을 테스트하는 것에 한정되지 않지만 이를 포함하여, 모델의 정확도를 결정하기 위한 임의의 적절한 메트릭 및/또는 기술이 사용될 수 있다. 일부 실시예에서, 맞춤화된 모델은 제2 정확도 스코어를 결정하기 전에 수정된 데이터 세트에 다시 맞춤화된다.At step 1040, the system 100 determines a second accuracy score of the customized predictive model for the modified predictive problem. The second accuracy score indicates the accuracy with which the customized model predicts one or more outcomes of the modified prediction problem. Any suitable metric and/or technique for determining the accuracy of a model may be used, including but not limited to testing the model against the holdout portion of the modified data set. In some embodiments, the customized model is refitted to the modified data set prior to determining the second accuracy score.

단계(1050)에서, 예측 모델링 절차(또는 맞춤화된 모델) 각각에 대해, 시스템(100)은 피처 F의 예측값을 계산한다. 일부 실시예에서, 모델링 절차 또는 모델에 대한 피처 F의 예측값은 (예를 들어, 모델에 대한 제1 및 제2 정확도 스코어 간의 차이에 기초하여) 정확도의 변화에 기초하여 계산된다. 일부 실시예에서, 제1 및 제2 정확도 스코어에 기초하여 모델링 절차 또는 모델에 대한 피처의 예측값을 결정하기 위해, 제1 정확도 스코어와 제2 정확도 사이의 차이가 증가함에 따라 예측값이 일반적으로 증가하게 하는 함수가 사용된다. 개별적인 예측값이 특정 모델링 절차 또는 모델에 특유할 수 있기 때문에, 단계(1050)에서 결정된 예측값은 본원에서 "모델-특유 예측값"으로 칭해질 수 있다.At step 1050 , for each predictive modeling procedure (or customized model), the system 100 calculates a predicted value of feature F. In some embodiments, a predicted value of a feature F for a model or modeling procedure is calculated based on a change in accuracy (eg, based on a difference between the first and second accuracy scores for the model). In some embodiments, to determine a predictive value of a feature for a modeling procedure or model based on the first and second accuracy scores, the predictive value generally increases as the difference between the first accuracy score and the second accuracy increases. function is used. Because individual predictions may be specific to a particular modeling procedure or model, the predictions determined in step 1050 may be referred to herein as “model-specific predictions”.

단계(1060)에서, 시스템(100)은 다른 피처의 예측값을 분석할지 여부를 결정한다. 일부 실시예에서, 이러한 결정은 사용자 입력에 기초하여 이루어진다(예를 들어, 사용자에 의해 특정된 모든 피처가 분석될 때까지 시스템이 피처의 예측값을 계속하여 분석함). 일부 실시예에서, 시스템은 데이터 세트의 모든 피처를 분석한다. 일부 실시예에서, 시스템은 데이터 세트 내의 피처의 서브세트만을 분석한다. 이러한 피처는 임의의 적절한 기준에 기초하여 선택될 수 있다. 시스템이 단계(1060)에서 분석할 다른 피처가 있는 것으로 결정하면, 시스템은 그 피처에 대해 단계(1030-1050)를 반복한다.At step 1060, the system 100 determines whether to analyze the predictions of other features. In some embodiments, this determination is made based on user input (eg, the system continues to analyze predicted values of features until all features specified by the user have been analyzed). In some embodiments, the system analyzes all features of the data set. In some embodiments, the system analyzes only a subset of the features in the data set. These features may be selected based on any suitable criteria. If the system determines in step 1060 that there are other features to analyze, the system repeats steps 1030-1050 for those features.

일부 실시예에서, 방법(1000)은 단계(1010)에서 수행되는 예측 모델링 절차를 선택하는 추가 단계(미도시)를 포함한다. 시스템(100)은 예를 들어, 예측 모델링 기술의 라이브러리(130)로부터 모델링 절차를 선택할 수 있다. 일부 실시예에서, 시스템(100)은 2개 이상의 상이한 예측 모델링 패밀리로부터 2개 이상의 모델링 절차를 선택한다. 예측 모델링 패밀리의 예는 선형 회귀 기술(예를 들어, 일반화된 가산 모델), 트리-기반 기술(예를 들어, 랜덤 포리스트), 지원 벡터 머신, 신경망(예를 들어, 다층 퍼셉트론(perceptron)) 등을 포함할 수 있다. 예를 들어, 시스템(100)은 트리 패밀리(예를 들어, 랜덤 포리스트 모델링 절차), 선형 회귀 패밀리로부터의 다른 모델링 절차(예를 들어, 일반화된 가산 모델) 및 지원 벡터 머신 모델링 절차를 선택할 수 있다.In some embodiments, method 1000 includes an additional step (not shown) of selecting a predictive modeling procedure to be performed at step 1010 . System 100 may select a modeling procedure from, for example, library 130 of predictive modeling techniques. In some embodiments, system 100 selects two or more modeling procedures from two or more different predictive modeling families. Examples of predictive modeling families include linear regression techniques (eg, generalized additive models), tree-based techniques (eg, random forests), support vector machines, neural networks (eg, multi-layer perceptrons), etc. may include For example, system 100 may select a tree family (e.g., a random forest modeling procedure), other modeling procedures from a linear regression family (e.g., a generalized additive model), and a support vector machine modeling procedure. .

시스템(100)은, (1) 모델-특유 예측값에 기초하여 데이터 세트를 프로세싱하고, (2) 사용자에게 데이터 세트의 평가를 제시하고, (3) 모델-특유 예측값에 기초하여 혼합을 위한 예측 모델을 선택하고, (4) 평가된 모델 및 연관된 모델-특유 예측값을 사용자에게 제시하고, (5) 모델-특유 예측값에 기초하여 예측 문제에 대한 잠재적인 예측 모델링 해결책의 공간을 평가하는 프로세스 중에 자원을 할당하고, (6) 피처의 모델-특유 예측값에 기초하여 피처의 모델-독립 예측값(피처 중요도 값)을 계산하는 것에 한정되지 않지만 이를 포함하여, 방법(1000)을 통해 결정된 하나 이상의 모델-특유 예측값(피처 중요도 값)을 사용하여 임의의 적절한 작업(예를 들어, 예측 모델링 작업)을 수행할 수 있다. 일부 실시예에서, 모델-독립 피처 중요도 값을 계산한 후에, 시스템(100)은 모델-독립 피처 중요도 값을 사용하여 임의의 적절한 작업(예를 들어, 이전 문장에서 설명된 바와 같은 작업들 (1)-(5))을 수행할 수 있다.The system 100 is configured to (1) process a data set based on model-specific predictions, (2) present an evaluation of the data set to a user, and (3) a predictive model for blending based on model-specific predictions. (4) present the evaluated model and associated model-specific predictions to the user, and (5) allocate resources during the process of evaluating the space of potential predictive modeling solutions to the prediction problem based on the model-specific predictions. and (6) one or more model-specific predictors determined via method 1000, including, but not limited to, calculating a model-independent predictive value (feature importance value) of a feature based on the model-specific predictive value of the feature. (feature importance value) can be used to perform any suitable operation (eg, predictive modeling operation). In some embodiments, after calculating the model-independent feature importance value, the system 100 uses the model-independent feature importance value to perform any suitable task (eg, tasks (1) as described in the previous sentence. )-(5)) can be performed.

일부 실시예에서, (예를 들어, 방법(400)의 단계(408)에서) 데이터 세트의 평가 중에 및/또는 (예를 들어, 방법(300)을 참조하여 전술한 바와 같이) 데이터 세트를 프로세싱할 때, 예측 모델링 절차의 일부로서, 시스템(100)은 모델-특유 예측값에 기초하여 피처 생성 및/또는 피처 엔지니어링을 수행한다. 예를 들어, 시스템(100)은 데이터 세트로부터 "덜 중요한" 피처를 정리할 수 있다. 이러한 맥락에서, 피처의 예측값이 임계 전압 미만인 경우, 피처가 데이터 세트 내의 피처의 M개의 최저 예측값 중 하나를 갖는 경우, 피처가 데이터 세트 내의 피처의 N개의 최고 예측값 중 하나를 갖지 않는 경우 등의 경우에 피처는 "덜 중요함"으로 분류될 수 있다. 다른 예로서, 시스템은 데이터 세트 내의 "더 중요한" 피처로부터 도출된 피처를 생성할 수 있다. 이러한 맥락에서, 피처의 예측값이 임계값보다 큰 경우, 피처가 데이터 세트 내의 피처의 N개의 최고 예측값 중 하나를 갖는 경우, 피처가 데이터 세트 내의 M개의 최저 예측값 중 하나를 갖지 않는 경우 등의 경우에, 피처는 "더 중요함"으로 분류될 수 있다. 일부 실시예에서, 시스템(100)은 방법(1000)을 사용하여 도출된 피처의 예측값을 계산할 수 있다.In some embodiments, processing the data set (eg, as described above with reference to method 300 ) and/or during evaluation of the data set (eg, at step 408 of method 400 ) When doing so, as part of the predictive modeling procedure, the system 100 performs feature creation and/or feature engineering based on model-specific predictive values. For example, system 100 may organize “less important” features from a data set. In this context, when a feature's predicted value is below a threshold voltage, when a feature has one of the M lowest predicted values of a feature in the data set, when a feature does not have one of the N highest predicted values of a feature in the data set, etc. A feature may be classified as "less important". As another example, the system may generate features derived from “more important” features in a data set. In this context, when the predicted value of a feature is greater than a threshold, when a feature has one of the N highest predicted values of a feature in the data set, when a feature does not have one of the M lowest predicted values in the data set, etc. , the feature can be classified as "more important". In some embodiments, system 100 may calculate predictive values of features derived using method 1000 .

일부 실시예에서, 시스템(100)은 (예를 들어, 방법(400)의 단계(410)에서) 데이터 세트의 평가를 사용자에게 제시(예를 들어, 표시)할 수 있고, 제시된 평가는 데이터 세트 내의 피처의 예측값 및/또는 이로부터 도출된 정보를 포함할 수 있다. 예를 들어, 하나 이상의 모델링 절차 또는 모델에 대해, 시스템(100)은 (1) "더 중요한" 및/또는 "덜 중요한 피처"를 식별하고, (2) 피처의 예측값을 표시하고, (3) 그 예측값에 의해 피처를 등급화하고, 및/또는 (4) 덜 중요한 피처의 수집이 중단되고 및/또는 덜 중요한 피처가 데이터 세트로부터 제거될 것을 추천할 수 있다. 사용자에게 데이터 세트의 평가를 제시하는 것에 응답하여, 시스템(100)은 (예를 들어, 방법(400)의 단계(412)에서) 데이터 세트의 정제를 특정하는 사용자 입력을 수신할 수 있다.In some embodiments, the system 100 may present (eg, display) an evaluation of the data set to the user (eg, at step 410 of the method 400 ), wherein the presented evaluation is the data set. predicted values of features in and/or information derived therefrom. For example, for one or more modeling procedures or models, system 100 may (1) identify "more important" and/or "less important features", (2) display predicted values of the features, and (3) Rank the features by their predictions, and/or (4) recommend that the collection of less important features be stopped and/or less important features are removed from the data set. In response to presenting the user with an evaluation of the data set, system 100 may receive user input specifying a refinement of the data set (eg, at step 412 of method 400 ).

일부 실시예에서, 시스템(100)은 모델-특유 예측값에 기초하여 혼합을 위한 예측 모델을 선택하고 (예를 들어, 방법(400)의 단계(432)에서) 선택된 모델을 혼합한다. 시스템(100)은 혼합을 위한 예측 모델을 선택하기 위해 임의의 적절한 기술을 사용할 수 있다. 예를 들어, 시스템(100)은 혼합을 위한 "상보적인 톱 모델"을 선택할 수 있다. 이러한 맥락에서 "보완적인 톱 모델"은 다른 메커니즘을 통해 높은 정확도를 달성하는 정확한 모델을 포함할 수 있다. 시스템(100)은, 모델의 정확도가 임계 정확도보다 큰 경우, 모델이 맞춤화된 모델 중에서 N개의 최고 정확도 값 중 하나를 갖는 경우, 모델이 맞춤화된 모델 중에서 M개의 최저 정확도 값 중 하나를 갖지 않는 경우 등의 경우에 모델을 "톱" 모델로서 분류할 수 있다. 시스템은, (1) 모델에 대한 가장 중요한 피처(예를 들어, 모델에 대한 최고 예측값을 갖는 피처)가 상이한 경우, 또는 (2) 제1 모델에 중요한 피처가 제2 모델에 중요하지 않고, 제1 모델에 중요하지 않은 피처가 제2 모델에 중요한 경우, 2개의 모델을 "상보적" 모델로서 분류할 수 있다. 이러한 맥락에서, 피처가 모델에 대해 높은 예측값을 갖는 경우, 피처는 모델에 대해 "중요"할 수 있다(예를 들어, 최고 예측값, 최고의 N개의 예측값 중 하나, 임계값보다 큰 예측값, 등). 이러한 맥락에서, 피처가 모델에 대한 낮은 예측값을 갖는 경우, 피처는 모델에 대해 "중요하지 않을" 수 있다(예를 들어, 최저 예측값, 최저 N개의 예측값 중 하나, 임계값보다 낮은 예측값, 등). 일부 실시예에서, 시스템(100)은 혼합을 위한 2개 이상의 상보적인 톱 모델을 선택하기 위해 전술한 분류 기술을 사용할 수 있다.In some embodiments, system 100 selects a predictive model for blending based on model-specific predictive values and blends the selected model (eg, at step 432 of method 400 ). System 100 may use any suitable technique to select a predictive model for blending. For example, system 100 may select a “complementary top model” for blending. A "complementary top model" in this context may include an accurate model that achieves high accuracy through other mechanisms. The system 100 determines if the model's accuracy is greater than a threshold accuracy, if the model has one of the N highest accuracy values among the customized models, if the model does not have one of the M lowest accuracy values among the customized models. etc., the model can be classified as a "top" model. The system determines if (1) the most important features for the model (e.g., the features with the highest predicted values for the model) differ, or (2) features that are important to the first model are not important to the second model, and If a feature that is not important to one model is important to a second model, then the two models can be classified as "complementary" models. In this context, a feature may be "important" to the model if it has a high predictive value for the model (eg, the highest predictor, one of the best N predictors, a predictor greater than a threshold, etc.). In this context, if a feature has a low predictive value for the model, then the feature may be "not important" to the model (eg, lowest predicted value, one of lowest N predicted values, predicted value lower than a threshold, etc.) . In some embodiments, system 100 may use the classification techniques described above to select two or more complementary top models for blending.

일부 경우에, 상보적인 톱 모델을 혼합하는 것은 컴포넌트 모델에 비해 매우 높은 정확도를 갖는 혼합된 모델을 산출할 수 있다. 반대로, 비상보적인 모델을 혼합하는 것은 컴포넌트 모델보다 상당히 더 높은 정확도로 혼합된 모델이 산출하지 않을 수 있다.In some cases, mixing complementary top models can yield a mixed model with very high accuracy compared to component models. Conversely, mixing non-complementary models may not yield mixed models with significantly higher accuracy than component models.

일부 실시예에서, 시스템(100)은 (예를 들어, 방법(400)의 단계(446)에서) 평가된 예측 모델 및 연관된 모델-특유 예측값을 사용자에게 제시할 수 있다. 일부 실시예에서, 시스템(100)은 예측 모델(예를 들어, 톱 모델)의 서브세트에 대해서만 피처 중요도 값을 계산 및/또는 표시한다. 피처 중요도 값을 사용자에게 제시하는 것은, 사용자가 평가된 모델의 상대적인 성능을 이해하는 것을 도울 수 있다. 예를 들어, 제시된 피처 중요도 값에 기초하여, 사용자(또는 시스템(100))는 다른 톱 모델보다 우수한 톱 모델 M 및 모델 M에 중요하지만 다른 톱 모델에는 중요하지 않은 하나 이상의 피처 F를 식별할 수 있다. 사용자는 다른 톱 모델에 비해, 모델 M이 피처 F에 의해 나타내어지는 정보를 더 잘 사용하는 것으로 결론을 내릴 수 있다(또는 시스템(100)이 이를 나타낼 수 있다). 이러한 발견에 기초하여, 모델 M에 의해 예상되는 결과에 관심이 있는 당사자는 피처 F를 측정 또는 제어하기 위해 시스템에 투자할 수 있으며, 이는, 모델이 예상하고 있는 결과를 개선할 수 있다.In some embodiments, system 100 may present the evaluated predictive model and associated model-specific predictive values to the user (eg, at step 446 of method 400 ). In some embodiments, system 100 calculates and/or displays feature importance values for only a subset of predictive models (eg, top models). Presenting feature importance values to the user may help the user understand the relative performance of the evaluated model. For example, based on the presented feature importance values, the user (or system 100) can identify a top model M that is superior to other top models and one or more features F that are important to model M but not to other top models. have. The user may conclude (or the system 100 may represent) that model M makes better use of the information represented by feature F, compared to other top models. Based on these findings, a party interested in the outcome expected by the model M may invest in a system to measure or control feature F, which may improve the outcome expected by the model.

일부 실시예에서, 예측 문제에 대한 잠재적인 예측 모델링 해결책의 공간을 평가하는 프로세스 동안(예를 들어, 방법(300)을 수행하는 동안), 시스템(100)은 예측 문제를 나타내는 데이터 세트 내의 피처의 예측값에 기초하여 모델링 절차의 평가를 위한 자원을 할당할 수 있다. 예를 들어, 시스템(100)은 (예를 들어, 방법(300)의 단계(310)에서, 또는 방법(400)의 단계(424)에서) 평가할 예측 모델링 절차를 선택하거나 제시할 수 있다. 전술한 바와 같이, 시스템(100)은 데이터 세트에 적절하거나 매우 적절한 것으로 예측되는 예측 모델링 절차를 선택하거나 제시할 수 있다. 특정의 특성을 갖는 데이터 세트에 대한 모델링 절차의 적절성을 결정하기 위한 일부 기술을 전술하였다. 예측 문제의 특성에 기초하여 예측 문제에 대한 예측 모델링 절차의 적절성을 결정할 때, 시스템(100)은 데이터 세트의 보다 중요한 피처의 특성을 예측 문제의 특성으로서 취급할 수 있다. 이러한 방식으로, 시스템(100)에 의해 생성된 적절성 스코어는 데이터 세트의 더 중요한 피처에 맞춤화될 수 있다. 시스템(100)은 적절성 스코어에 기초하여 자원을 할당하기 때문에, 데이터 세트의 보다 중요한 피처에 적절성 스코어를 맞춤화하는 것은 피처 중요도에 부분적으로 기초하여 예측 모델링 절차의 평가에 할당되는 자원으로 귀결될 수 있다.In some embodiments, during the process of evaluating the space of a potential predictive modeling solution to a predictive problem (eg, while performing method 300 ), system 100 determines the number of features in a data set representing a predictive problem. Resources for evaluation of modeling procedures may be allocated based on the predicted values. For example, system 100 may select or present a predictive modeling procedure to evaluate (eg, at step 310 of method 300 , or at step 424 of method 400 ). As noted above, system 100 may select or suggest predictive modeling procedures that are predicted to be appropriate or very appropriate for a data set. Some techniques for determining the adequacy of a modeling procedure for data sets with particular characteristics have been described above. In determining the appropriateness of a predictive modeling procedure for a prediction problem based on the characteristics of the prediction problem, the system 100 may treat the characteristics of the more important features of the data set as characteristics of the prediction problem. In this way, the relevance score generated by the system 100 can be tailored to more important features of the data set. Because the system 100 allocates resources based on relevance scores, tailoring relevance scores to more important features of the data set may result in resources being allocated for evaluation of predictive modeling procedures based in part on feature importance. .

추가적으로 또는 대안적으로, 시스템(100)은 (예를 들어, 방법(300) 또는 방법(400)의 단계(408)를 참조하여 전술한 바와 같이, 데이터 세트의 평가 중에 및/또는 데이터 세트를 프로세싱할 때) 데이터 세트의 더 중요한 피처를 갖는 피처 생성 작업에 더 많은 자원을 할당하고, 및/또는 (예를 들어, 방법(400)의 단계(432)에서) 상보적인 톱 모델의 혼합에 더 많은 자원을 할당할 수 있다.Additionally or alternatively, the system 100 may process the data set and/or during evaluation of the data set (eg, as described above with reference to step 408 of method 300 or method 400 ). allocating more resources to the feature creation task with more significant features of the data set, and/or more to the mixing of complementary top models (eg, at step 432 of method 400 ) resources can be allocated.

일부 실시예에서, 시스템(100)은 피처의 모델-특유 예측값에 기초하여 피처 F의 모델-독립 예측값을 계산할 수 있다. 시스템(100)은 (1) 모델-특유 예측값의 통계적 측정을 계산하거나((예를 들어, 평균(mean), 중앙값(median), 표준 편차 등), (2) 모델-특유 예측값의 조합을 결정하는 것에 한정되지 않지만 이를 포함하여, 임의의 적절한 기술을 사용하여 피처의 모델-독립 예측값을 계산할 수 있다. 후자의 경우, 조합은 가중화된 조합일 수 있다. 가중화된 조합에서, 보다 정확한 모델에 대한 모델-특유 피처 값은 보다 덜 정확한 모델에 대한 모델-특유 피처 값보다 더 크게 가중화될 수 있다. 일부 실시예에서, 가장 덜 정확한 모델링 절차에 대한 모델-특유 피처 값은 이 단락에서 설명된 계산 및/또는 조합으로부터 배제될 수 있다.In some embodiments, system 100 may calculate a model-independent predictive value of feature F based on the model-specific predictive value of the feature. System 100 may (1) compute statistical measures of model-specific predictors (eg, mean, median, standard deviation, etc.), or (2) determine combinations of model-specific predictors. Any suitable technique can be used to compute the model-independent predictive value of a feature, including but not limited to, in the latter case, the combination can be a weighted combination. The model-specific feature values for the less accurate model may be weighted more heavily than the model-specific feature values for the less accurate models, In some embodiments, the model-specific feature values for the least accurate modeling procedures are described in this paragraph. may be excluded from calculated calculations and/or combinations.

2차 모델2nd model

(예를 들어, "모델 배치 엔진"이라는 제목의 섹션에서) 전술한 바와 같이, 예측 코드 생성 작업을 단순화하고 모델을 생성하는 데 사용되는 모델링 기술의 독점적인 세부 사항의 공개를 방지하기 위해 적절한 2차 모델링 기술(예를 들어, RuleFit)을 사용하여 모델 중의 모델(“2차 모델")을 생성하는 것이 바람직할 수 있다. 또한 사람들은 원래 모델보다 2차 모델을 더 쉽게 인터프리팅할 수 있으며; 2차 모델은 그렇지 않은 "블랙 박스" 모델에 대한 직관을 제공할 수 있다.As noted above (e.g., in the section titled "Model Deployment Engine"), appropriate 2 It may be desirable to generate a model-of-model (“second-order model”) using a quadratic modeling technique (eg RuleFit), which also allows people to interpret the quadratic model more easily than the original model, ; a quadratic model can provide an intuition for a "black box" model that does not.

그러나, 2차 모델이 소스 모델보다 예측에 대해 체계적으로 덜 정확할 수도 있다는 우려가 있다. 소스 모델로부터 2차 모델로(그리고 일부 경우, 소스 모델보다 더 높은 정확도를 갖는 2차 모델을 생성하기 위해) 이동하는 것과 연관된 정확성의 임의의 손실을 감소시키기 위한(예를 들어, 최소화하는) 기술의 일부 실시예가 아래에 설명된다.However, there are concerns that the quadratic model may be systematically less accurate for prediction than the source model. Techniques for reducing (eg, minimizing) any loss of accuracy associated with moving from a source model to a quadratic model (and in some cases to produce a quadratic model with higher accuracy than the source model) Some embodiments of is described below.

일부 실시예에서, 사용자가 맞춤화된 모델(예를 들어, 방법(400)의 단계(446)) 및/또는 톱 모델에 대한 홀드아웃 테스트의 결과(예를 들어, 방법(400)의 단계(450))를 제시한 후에, 사용자는 (a) 정확한 예측을 생성하기 위해 피처를 어떻게 사용하는지를 이해하기 위해 모델을 인터프리팅하고 및/또는 (b) 배치 엔진(140)을 사용하여 모델을 배치하기를 원할 수 있다. 특정 모델링 기술은 모델을 구축하기 위해 불투명한 및/또는 복잡한 모델을 사용할 수 있으며, 이에 의해 적어도 일부 경우에, 모델을 이해하기 어렵게 만들고 예측 코드 생성의 어려움을 증가시킬 수 있다. 독점적인 모델 구축 기술의 공개를 방지하는 것은 모델을 이해할 수 있게 만들고 모델에 대한 예측 코드를 생성하는 과제를 더욱 악화시킬 수 있다.In some embodiments, the user customized model (eg, step 446 of method 400 ) and/or the result of a holdout test against a top model (eg, step 450 of method 400 ) )), the user (a) interprets the model to understand how to use the features to generate accurate predictions and/or (b) deploys the model using the deployment engine 140 may want Certain modeling techniques may use opaque and/or complex models to build a model, which may, in at least some cases, make the model difficult to understand and increase the difficulty of generating predictive code. Preventing disclosure of proprietary model-building techniques can make the model understandable and exacerbate the challenge of generating predictive code for the model.

일부 실시예에서, 예측 모델링 시스템(100)은 소스 모델의 2차 모델을 구축함으로써 이러한 문제를 해결한다. 일부 실시예에서, 예측 모델링 시스템(100)은 일반적으로 비교적 인터프리팅하기 쉬운 모델을 생성하는 하나 이상의 모델링 기술(예를 들어, RuleFit, 일반화된 가산 모델, 등)을 사용하여 2차 모델을 구축하고, 이를 위해 예측 코드는 비교적 쉽게 생성될 수 있다. 이러한 기술이 본원에서 "2차 모델링 기술"로 칭해진다. 엔진(110)이 맞춤화된 "1차" 모델(예를 들어, 사용자가 더 잘 이해하기 원하는 1차 모델, 사용자가 예측 코드를 생성하기 원하는 1차 모델 및/또는 독점적 피처를 갖는 1차 모델)을 생성한 후, 시스템(100)은 2차 모델링 기술을 사용하여 1차 모델의 2차 모델을 생성할 수 있다.In some embodiments, the predictive modeling system 100 addresses this problem by building a quadratic model of the source model. In some embodiments, predictive modeling system 100 builds quadratic models using one or more modeling techniques (eg, RuleFit, generalized additive models, etc.) that generally produce models that are relatively easy to interpret. And for this, the prediction code can be generated relatively easily. This technique is referred to herein as a “second-order modeling technique”. A “first-order” model to which the engine 110 is customized (eg, a first-order model that the user wants to better understand, a first-order model that the user wants to generate predictive code for, and/or a first-order model with proprietary features) After generating , the system 100 may generate a secondary model of the primary model using secondary modeling techniques.

1차 모델 내의 각 피처에 대해, 원래 데이터 세트로부터의 또는 원래 데이터 세트로부터 도출된 피처 값의 대응하는 세트가 존재한다. 2차 모델링 기술은 1차 모델과 동일한 피처를 사용할 수 있으며, 따라서 2차 모델링 기술의 트레이닝 및 테스팅 데이터를 위해, 이러한 피처의 원래 값 또는 그 서브세트를 사용할 수 있다. 일부 실시예에서, 데이터 세트로부터의 타겟의 실제 값을 사용하는 대신, 2차 모델링 기술은 1차 모델로부터의 타겟의 예측된 값을 사용한다.For each feature in the first-order model, there is a corresponding set of feature values from or derived from the original data set. The secondary modeling technique may use the same features as the primary model, and thus, for training and testing data of the secondary modeling technique, the original values of these features or a subset thereof. In some embodiments, instead of using the actual values of the target from the data set, the second-order modeling technique uses the predicted values of the target from the first-order model.

일부 경우에, 2차 모델링 기술은 대안적인 또는 보충적인 트레이닝 및/또는 테스팅 데이터를 사용할 수 있다. 이러한 대안은 동일하거나 다른 데이터 소스로부터의 다른 실세계 데이터, (예를 들어, 실세계 샘플에 존재하는 것보다 더 넓은 범위의 가능성을 커버하기 위한 목적으로)(예를 들어, 내삽법 및 외삽법을 통해) 머신-생성 데이터와 조합된 실세계 데이터, 또는 머신-기반의 확률론적 모델에 의해 완전히 생성된 데이터를 포함할 수 있다. 일부 실시예에서, 2차 모델을 트레이닝하는 데 사용되는 타겟 변수의 값은 1차 모델로부터의 예측된 값이다.In some cases, secondary modeling techniques may use alternative or supplemental training and/or testing data. These alternatives include other real-world data from the same or different data sources (e.g., for the purpose of covering a wider range of possibilities than exist in real-world samples) (e.g., via interpolation and extrapolation). ) real-world data combined with machine-generated data, or data entirely generated by machine-based probabilistic models. In some embodiments, the values of the target variables used to train the second-order model are predicted values from the first-order model.

한 가지 고려 사항은, 2차 모델을 구축할 때 1차 모델의 임의의 에러가 합성 또는 확대될 수 있고, 이에 의해 2차 모델의 정확도를 체계적으로 감소시킬 수 있다는 것이다. 첫째, 본 발명자들은 이것이 경험적으로 사실인지 여부에 대한 의문이 있다는 것을 인식하고 이해하고 있다. 둘째로, 어떤 경우에 그것이 사실이라면, 본 발명자들은 보다 정확한 1차 모델을 사용하는 것이 정확도의 손실을 감소시킬 가능성이 있음을 인식하고 이해하고 있다. 예를 들어, (예를 들어, "모델링 공간 탐색 엔진"이라는 제목의 섹션 끝 부분에 설명된 바와 같이) 혼합 모델은 어떤 단일 모델보다도 때로 더 정확하기 때문에, 2차 모델을 1차 모델의 혼합에 맞춤화하는 것은 2차 모델링과 연관된 정확도의 임의의 손실을 감소시킬 수 있다.One consideration is that any errors in the first-order model may be synthesized or magnified when building the second-order model, thereby systematically reducing the accuracy of the second-order model. First, the inventors recognize and understand that there are questions about whether this is empirically true. Second, if that is the case, we recognize and understand that using a more accurate first-order model has the potential to reduce the loss of accuracy. For example, as a mixture model is sometimes more accurate than any single model (as described at the end of the section titled "Modeling Space Search Engine," for example), a second-order model can be used in a mixture of first-order models. Customizing can reduce any loss of accuracy associated with quadratic modeling.

본 발명자들은 2차 모델의 정확도에 대한 우려가 크게 잘못된 것으로 경험적으로 결정했다. 1195개 분류 및 1849개 회귀 1차 모델을 초래한 381개의 데이터 세트에 대한 시스템의 테스트가 수행되었다. 테스트된 분류 모델의 경우, 2차 모델의 43%가 대응하는 1차 모델보다 덜 정확했지만 정확도의 로그 손실 측정에 따르면 10% 이상 악화되지 않았다. 2차 모델의 30%는 실제로 1차 모델보다 정확했다. 그 경우의 27%만이 2차 모델이 1차 모델보다 10% 덜 정확했다. 이 경우의 약 1/3(전체 모집단의 약 9%)은 데이터 세트가 매우 작을 때 발생했다. 1차 모델이 0.1 로그 손실 미만에서 매우 정확하고 2차 모델이 또한 여전히 0.1 로그 손실 미만에서 매우 정확했을 때, 이 경우의 1/3(다시 전체 모집단의 약 9%)이 발생했다. 따라서, 데이터 세트가 충분히 큰 경우의 90% 초과에서, 2차 모델은 1차 모델의 10% 내이거나 2차 모델은 절대 표준에 따라 매우 정확했다. 그 경우의 41%에서 최상의 2차 모델이 1차 모델의 혼합으로부터 도출되었다.We have empirically determined that concerns about the accuracy of the quadratic model are largely wrong. Testing of the system was performed on 381 data sets resulting in 1195 classifications and 1849 regression first-order models. For the tested classification models, 43% of the second-order models were less accurate than the corresponding first-order model, but did not deteriorate more than 10% according to the log-loss measure of accuracy. 30% of the second model was actually more accurate than the first model. In only 27% of those cases, the second-order model was 10% less accurate than the first-order model. About one-third of these cases (about 9% of the total population) occurred when the data set was very small. One third of these cases (again about 9% of the total population) occurred when the primary model was very accurate at less than 0.1 log loss and the secondary model was also still very accurate below 0.1 log loss. Thus, in more than 90% of cases where the data set was large enough, the second-order model was either within 10% of the first-order model or the second-order model was very accurate according to the absolute standard. In 41% of the cases, the best secondary model was derived from a mixture of primary models.

테스트된 회귀 모델에 대해, 2차 모델의 39%는 대응하는 1차 모델보다 덜 정확했지만, 정확도의 잔류 평균 제곱 에러 측정에 따르면 10% 이하로 악화되었다. 2차 모델의 47%는 실제로 1차 모델보다 더 정확했다. 그 경우 14%만이 2차 모델이 1차 모델보다 10% 덜 정확했다. 그 경우의 약 10%(전체 모집단의 약 1.5%)는 데이터 세트가 매우 작을 때 발생했다. 모든 경우의 35%에서, 최상의 2차 모델은 1차 모델의 혼합으로부터 도출되었다.For the regression models tested, 39% of the second-order models were less accurate than the corresponding first-order model, but deteriorated to less than 10% according to the residual mean square error measure of accuracy. 47% of the second model was actually more accurate than the first model. In only 14% of cases, the second-order model was 10% less accurate than the first-order model. About 10% of those cases (about 1.5% of the total population) occurred when the data set was very small. In 35% of all cases, the best secondary model was derived from a mixture of primary models.

이러한 경험적 데이터에 기초하여, 본 발명자들은 2차 모델이 일반적으로, 원래의 데이터 세트가 적절하게 큰 경우, 비교적 정확하거나 절대적으로 정확하다는 것을 인식하고 이해하고 있다. 많은 경우, 2차 모델은 실제로 1차 모델보다 더 정확하다. 마지막으로, 테스트된 모든 분류 및 회귀 문제 중 1/3 초과에서 가장 정확한 2차 모델이 1차 모델의 혼합에서 도출되었다.Based on these empirical data, the inventors recognize and understand that quadratic models are generally relatively accurate or absolutely accurate when the original data set is suitably large. In many cases, the second-order model is actually more accurate than the first-order model. Finally, for more than one-third of all classification and regression problems tested, the most accurate secondary model was derived from a mix of primary models.

본 발명자들은 2차 모델이, (a) 복잡한 1차 모델을 이해하고, (b) 예측 모델에 대한 예측 코드를 생성하는 작업을 단순화하고 (c) 독점적인 모델-구축 기술을 보호하는 데 유리할 수 있음을 인식하고 이해하였다. 혼합 모델은 많은 때에 최상의 예측 결과를 생성하지만, 혼합에 포함된 모든 모델의 모든 컴포넌트의 복잡성을 조합하기 때문에 일반적으로 또한 더 복잡하다. 또한, 혼합 모델에 대한 예측 코드를 생성하는 것은 일반적으로 혼합에 포함된 모든 컴포넌트 모델에 대한 예측 코드를 생성하는 과제를 조합한다. 이러한 컴포넌트 과제의 최상측에서, 모든 혼합 모델이 일반적으로 각 예측치를 생성하기 위해 연산되므로, 혼합 모델은 일반적으로 예측치를 생성하는 데 더 느리다(및/또는 예측치를 생성하기 위해 더 많은 연산 자원이 필요함). 2차 모델은 일반적으로 이 시간(및/또는 연산 자원의 사용)을 단일 모델을 연산하는 시간으로 감소시킨다. 또한, 혼합 모델에는 각 컴포넌트 모델의 독점 피처를 포함한다. 따라서, 2차 모델과 혼합 모델은 매우 상보적인 방식으로 동작한다.We believe that second-order models may be advantageous for (a) understanding complex first-order models, (b) simplifying the task of generating predictive codes for predictive models, and (c) protecting proprietary model-building techniques. recognized and understood that Mixed models often produce the best predictive results, but are also generally more complex because they combine the complexity of all components of all models involved in the mix. Also, generating predictive codes for a mixture model typically combines the task of generating predictive codes for all component models involved in the mixture. At the top of this component task, mixed models are generally slower to generate predictions (and/or require more computational resources to generate predictions), as all mixed models are typically computed to generate each prediction. ). Quadratic models typically reduce this time (and/or use of computational resources) to the time to compute a single model. In addition, the hybrid model includes proprietary features of each component model. Thus, the quadratic model and the mixed model behave in a highly complementary manner.

또한, 시스템(100)이 데이터 세트를 트레이닝, 테스트 및 홀드아웃 분할로 자동 분리할 수 있기 때문에, 시스템은 임의의 특정 2차 모델이 1차 모델과 비교하여 적절하게 수행되는지 여부를 용이하게 결정할 수 있다.In addition, because the system 100 can automatically split the data set into training, testing, and holdout partitions, the system can easily determine whether any particular second-order model performs properly compared to the first-order model. have.

2차 모델링 기술의 일부 실시예Some Examples of Secondary Modeling Techniques

도 11a를 참조하면, 2차 예측 모델을 생성하기 위한 방법(1100)은 단계(1110 및 1120)를 포함할 수 있다. 단계 1110에서, 맞춤화된 1차 예측 모델이 획득된다. 1차 예측 모델은 하나 이상의 제1 입력 변수의 값에 기초하여 예측 문제의 하나 이상의 출력 변수의 값을 예측하도록 구성된다. 1차 모델은 임의의 적절한 기술을 사용하여(예를 들어, 예측 모델링 절차를 구현하는 머신-실행 가능 모듈을 실행하고, 예측 모델링 방법(300)의 실시예를 수행하고, 2개 이상의 예측 모델을 혼합하는 것 등) 획득될 수 있으며, 예측 모델(예를 들어, 시계열 모델 등)의 임의의 적절한 유형일 수 있다. 단계(1120)에서, 2차 예측 모델링 절차는 맞춤화된 1차 모델에 대해 수행된다. 2차 모델링 절차는 2차 예측 모델(예를 들어, RuleFit 모델, 일반화된 가산 모델, 조건부 규칙의 세트로서 조직된 임의의 모델, 이전 유형의 모델 중 2개 이상의 혼합, 등)과 연관된다.Referring to FIG. 11A , a method 1100 for generating a quadratic predictive model may include steps 1110 and 1120 . In step 1110, a customized first-order predictive model is obtained. The first-order predictive model is configured to predict values of one or more output variables of the prediction problem based on values of one or more first input variables. The first-order model may be constructed using any suitable technique (e.g., executing machine-executable modules implementing predictive modeling procedures, performing embodiments of predictive modeling method 300, constructing two or more predictive models) mixing, etc.), and may be any suitable type of predictive model (eg, a time series model, etc.). At step 1120, a second order predictive modeling procedure is performed on the customized first order model. A quadratic modeling procedure involves a quadratic predictive model (eg, a RuleFit model, a generalized additive model, any model organized as a set of conditional rules, a mixture of two or more of previous types of models, etc.).

도 11b를 참조하면, 2차 예측 모델링 절차를 수행하기 위한 방법(1120)은 ㄷ단계(1122, 1144, 1126 및 1128)를 포함할 수 있다. 단계(1122)에서, 2차 입력 데이터가 생성된다. 2차 입력 데이터는 복수의 2차 관측치를 포함한다. 각각의 2차 관측치는 하나 이상의 제2 입력 변수의 관측치 및 제2 입력 변수의 값에 대응하는 제1 입력 변수의 값에 기초하여 1차 모델에 의해 예측된 출력 변수의 값을 포함한다. 2차 입력 데이터를 생성하는 것은 각각의 2차 관측치에 대해 이하를 포함할 수 있다: 제2 입력 변수의 관측된 값 및 제1 입력 변수의 대응 관측된 값을 획득하는 것, 및 출력 변수의 예측된 값을 생성하기 위해 1차 예측 모델을 제1 입력 변수의 대응하는 관측된 값에 적용하는 것.Referring to FIG. 11B , the method 1120 for performing the secondary predictive modeling procedure may include c steps 1122 , 1144 , 1126 and 1128 . At step 1122 , secondary input data is generated. The secondary input data includes a plurality of secondary observations. Each secondary observation includes the value of the output variable predicted by the primary model based on the observation of the one or more second input variables and the value of the first input variable corresponding to the value of the second input variable. Generating the secondary input data may include for each secondary observation: obtaining an observed value of the second input variable and a corresponding observed value of the first input variable, and predicting the output variable. Applying the first-order predictive model to the corresponding observed values of the first input variable to generate the observed values.

1차 모델 및 2차 모델은 입력 변수의 동일한 세트를 사용할 수 있다. 대안 적으로, 2차 모델은 1차 모델의 입력 변수 중 하나 이상(예를 들어, 모두)을 사용할 수 있고, 하나 이상의 다른 입력 변수를 또한 사용할 수 있다. 대안적으로, 2차 모델은 1차 모델의 입력 변수 중 어느 것도 사용하지 않고, 하나 이상의 다른 입력 변수를 사용할 수 있다.The first-order and second-order models can use the same set of input variables. Alternatively, the second-order model may use one or more (eg, all) of the input variables of the first-order model, and may also use one or more other input variables. Alternatively, the second-order model may use none of the input variables of the first-order model, but one or more other input variables.

단계(1124)에서, 2차 입력 데이터 및 2차 테스팅 데이터가는 2차 입력 데이터로부터 생성된다. 트레이닝 데이터를 생성하고 입력 데이터 세트로부터 데이터를 테스트하기 위한 기술의 일부 실시예가 본원에 설명된다.In step 1124, secondary input data and secondary testing data are generated from the secondary input data. Some embodiments of techniques for generating training data and testing data from an input data set are described herein.

단계(1126)에서, 2차 예측 모델을 2차 트레이닝 데이터에 맞춤화함으로써 맞춤화된 1차 모델의 맞춤화된 2차 예측 모델이 생성된다. 단계(1128)에서, 맞춤화된 1차 모델의 맞춤화된 2차 예측 모델이 2차 테스팅 데이터에 대해 테스트된다. 상호 유효성 검증(네스팅된 상호 유효성 검증을 포함하나 이에 한정되지 않음) 및 홀드아웃 기술이 예측 모델을 맞춤화 및/또는 테스트하는 데 사용될 수 있다.In step 1126 , a customized secondary predictive model of the customized primary model is generated by fitting the secondary predictive model to secondary training data. In step 1128, the customized secondary predictive model of the customized primary model is tested against secondary testing data. Mutual validation (including but not limited to nested mutual validation) and holdout techniques may be used to customize and/or test predictive models.

일부 실시예에서, 1차 및 2차 맞춤화 모델이 비교될 수 있다. 예를 들어, 각각의 맞춤화 모델에 대한 정확도 스코어가 결정될 수 있다. 각 맞춤화 모델의 정확도 스코어는, 맞춤화 모델이 하나 이상의 예측 문제의 결과를 예측하는 정확도를 나타낼 수 있다. (일부 실시예에서, 2차 모델의 정확도 스코어는 맞춤화된 모델이 1차 모델에 의해 예측된 타겟 변수 값을 예측하는 정확도가 아니라 맞춤화된 모델이 타겟 변수의 관측치를 예측하는 정확도를 나타낸다. 다른 실시예에서, 2차 모델의 정확도 스코어는, 맞춤화된 모델이 1차 모델에 의해 예측된 타겟 변수 값을 예측하는 정확도를 나타낸다.) 일부 경우에, 맞춤화된 2차 모델의 정확도 스코어는 맞춤화된 1차 모델의 정확도 스코어를 초과한다. 다른 예로서, 하나 이상의 예측 문제의 결과를 예측하기 위해 각각의 맞춤화된 예측 모델에 의한 연산 자원의 양이 사용된다. 일부 실시예에서, 맞춤화된 2차 모델에 의해 사용된 연산 자원의 양은 맞춤화된 1차 모델에 의해 사용되는 연산 자원의 양보다 적다.In some embodiments, primary and secondary customization models may be compared. For example, an accuracy score may be determined for each custom model. The accuracy score of each custom model may indicate the accuracy with which the custom model predicts the outcome of one or more prediction problems. (In some embodiments, the accuracy score of the second-order model is indicative of the accuracy with which the customized model predicts observations of the target variable, rather than the accuracy with which the customized model predicts the target variable values predicted by the first-order model. Other implementations In an example, the accuracy score of the second-order model is indicative of the accuracy with which the customized model predicts the target variable values predicted by the first-order model.) In some cases, the accuracy score of the customized second-order model is the customized first-order model. Exceeds the accuracy score of the model. As another example, the amount of computational resources by each customized predictive model is used to predict the outcome of one or more predictive problems. In some embodiments, the amount of computational resources used by the customized second-order model is less than the amount of computational resources used by the customized first-order model.

일부 실시예에서, 복수의 2차 모델링 절차가 수행되고, 1차 모델의 복수의 2차 모델이 생성된다. 일부 실시예에서, 2차 모델의 정확도 스코어 및/또는 연산 자원 사용이 결정되고 비교된다. 하나 이상의 2차 모델은 모델의 정확도 스코어 및/또는 연산 자원 사용에 적어도 부분적으로 기초하여 배치를 위해 선택될 수 있다.In some embodiments, a plurality of secondary modeling procedures are performed, and a plurality of secondary models of the primary model are generated. In some embodiments, accuracy scores and/or computational resource usage of the quadratic models are determined and compared. One or more secondary models may be selected for deployment based at least in part on the accuracy score and/or computational resource usage of the model.

예측 문제에 대한 예측 모델을 선택하기 위한 방법(300)의 일부 실시예가 예측 문제에 대한 2차 예측 모델을 선택하는 데 사용될 수 있다. 일부 실시예에서, 입력 데이터의 하나 이상의 피처의 모델-특유 예측값이 (예를 들어, 후술하는 방법(1000)의 실시예를 사용하여) 2차 모델에 대해 결정될 수 있다. 일부 실시예에서, (예를 들어, 본원에 설명되는 기술을 사용하여) 2차 예측 모델이 혼합될 수 있다. 일부 실시예에서, (예를 들어, 본원에 설명되는 기술 및/또는 성능 향상을 사용하여) 2차 예측 모델이 배치 및/또는 리프레시될 수 있다.Some embodiments of the method 300 for selecting a predictive model for a predictive problem may be used to select a quadratic predictive model for a predictive problem. In some embodiments, model-specific predictive values of one or more features of the input data may be determined for a quadratic model (eg, using embodiments of method 1000 described below). In some embodiments, quadratic predictive models may be blended (eg, using techniques described herein). In some embodiments, the quadratic predictive model may be deployed and/or refreshed (eg, using techniques and/or performance enhancements described herein).

예측 문제에 대한 예측 모델을 선택하기 위한 방법(300)의 일부 실시예가 예측 문제에 대한 2차 예측 모델을 선택하는 데 사용될 수 있다.Some embodiments of the method 300 for selecting a predictive model for a predictive problem may be used to select a quadratic predictive model for a predictive problem.

텍스트 언어 조건text language condition

데이터 세트의 일부 피처는 구조화되지 않은 텍스트 블록을 포함할 수 있다. 다른 언어는 다른 특성을 가지므로, 최상의 예측 모델링 단계는 구조화되지 않은 텍스트 피처에 어떤 언어 또는 언어들이 존재하는지에 따를 수 있다.Some features of a data set may contain unstructured text blocks. Since different languages have different characteristics, the best predictive modeling step may be depending on which language or languages are present in the unstructured text feature.

일부 실시예에서, 시스템(100)은 텍스트 스트링으로 언어를 검출하거나, 보다 일반적으로 스트링이 주어진 언어에 있을 가능성을 계산한다. 시스템(100)은 텍스트 언어의 지식을 조건부 텍스트 프로세싱을 통해 예측 모델링 프로세스에 통합할 수 있다. 예를 들어, 모델링 기술은 텍스트로부터 구조화된 피처를 추출하는 것을 포함하는 사전-프로세싱 단계를 포함할 수 있다. 피처 추출 절차는 언어에 기초하여 다를 수 있다. 또 다른 예로서, 일부 모델링 기술은 특정 언어에 특정되거나 다른 언어에 대해 다르게 수행할 수 있다.In some embodiments, system 100 detects a language as a text string, or more generally calculates the likelihood that the string is in a given language. System 100 may incorporate knowledge of the text language into a predictive modeling process through conditional text processing. For example, modeling techniques may include pre-processing steps that include extracting structured features from text. The feature extraction procedure may vary based on language. As another example, some modeling techniques may be specific to a particular language or perform differently for other languages.

예를 들어, 문자 n-gram은 중국어 텍스트에서 효과적이다. 영어 텍스트에서 종종 좋은 결과를 만들지만 일관성이 떨어지므로, 통상적으로 영어 텍스트의 프로세싱을 위해서는 n-gram이 선호된다. 독일어와 핀란드어와 같은 많은 형태소에서 한 단어를 형성하는 "합성" 언어도 있다. 단어 n-gram은 이 구조에서 정보 중 많은 것을 잃을 수 있으므로, 이러한 언어의 프로세싱 텍스트는 일반적으로 단어로부터 형태소를 추출하기 위해 특수화된 토큰화를 추가로 요구한다. 또한, 다른 언어로부터의 텍스트의 프로세싱은 중지 단어 제거, 스테밍(stemming) 및 표제어 추출(lemmatization)을 위한 다른 처리로부터 이익을 받을 수 있다.For example, the character n-gram is effective in Chinese text. Usually, n-grams are preferred for processing English texts because they often produce good results but are inconsistent in English texts. There are also "synthetic" languages, such as German and Finnish, that form a word from many morphemes. Since word n-grams can lose much of the information in this structure, processing texts in these languages typically require additional specialized tokenization to extract morphemes from words. Further, processing of text from other languages may benefit from other processing for stopword removal, stemming, and lemmatization.

일부 실시예에서, 개별 모델링 기술은 검출된 언어에 의존하는 텍스트 프로세싱을 위한 조건부 로직을 가질 수 있다. 예를 들어, 모델링 기술은 유럽 언어에 대한 단어 n-gram 및 중국어 언어에 대한 문자 n-gram을 사용할 수 있다. 엔진(110)은 실행을 위해 디스패치될 때 검출된 언어 또는 언어들을 모델링 기술로 전달할 수 있다.In some embodiments, individual modeling techniques may have conditional logic for text processing that depends on the detected language. For example, the modeling technique may use a word n-gram for a European language and a character n-gram for a Chinese language. Engine 110 may pass the detected language or languages to a modeling technique when dispatched for execution.

일부 실시예에서, 엔진(110)은 언어에 기초한 평가로부터 모델링 기술 또는 모델링 기술의 패밀리를 배제하거나, 언어에 기초하여 그 자원 할당(예를 들어, 프로세싱 우선 순위)을 변경할 수 있다. 일부 실시예에서, 모델링 기술은, 그 기술이 적절한 언어 또는 언어들을 특정하는 연관 메타데이터를 가질 수 있다. 일부 실시예에서, 모델링 기술은, 그 기술이 언어의 세트에 대해 얼마나 잘 수행할 것인지를 나타내는 연관된 메타데이터를 가질 수 있다. 이러한 성능 추정은 모델링 기술에 대한 이전 정보에 기초할 수 있거나, 상이한 모델링 기술이 상이한 언어로 얼마나 잘 수행되는지에 기초하여 메타 머신 학습을 통해 계산될 수 있다.In some embodiments, engine 110 may exclude a modeling technique or family of modeling techniques from evaluation based on language, or change its resource allocation (eg, processing priority) based on language. In some embodiments, a modeling technique may have associated metadata that specifies the language or languages for which the description is appropriate. In some embodiments, a modeling technique may have associated metadata that indicates how well the technique will perform for a set of languages. These performance estimates may be based on previous information about modeling techniques, or they may be computed through meta-machine learning based on how well different modeling techniques perform in different languages.

상호 작용 강도interaction strength

예측 모델에서, 함께 취해진 2개 이상의 예측 변수(예를 들어, 피처)는 개별적으로 취해진 이들 변수에 추가하여 또는 그 대신에 타겟 변수에 영향을 줄 수 있다. 일부 실시예에서, 시스템은 광범위하게 다양한 예측 문제에 걸쳐 이러한 상호 작용을 나타내는 메트릭 또는 메트릭 세트의 값을 결정한다. 일부 실시예에서, 엔진(110)은 적절한 상호 작용 모델링 기술을 자동으로 적용하는 프로세스에 메트릭을 통합한다. 일부 실시예에서, 엔진(110)은 사용자로 하여금 대안적인 사전-패키지화된 상호 작용 모델링 기술 중에서 선택하거나 그 자신의 커스텀 기술을 적용할 수 있게 한다.In a predictive model, two or more predictor variables (eg, features) taken together may influence a target variable in addition to or in lieu of those variables taken separately. In some embodiments, the system determines values of a metric or set of metrics that represent such interactions across a wide variety of prediction problems. In some embodiments, engine 110 incorporates metrics into a process that automatically applies appropriate interaction modeling techniques. In some embodiments, engine 110 allows a user to select from alternative pre-packaged interaction modeling techniques or apply their own custom techniques.

예를 들어, 자동차 보험의 맥락에서, "연령"과 "성별" 피처 간의 상호 작용이 있을 수 있다. 일부 연령에서는 여성이 남성보다 안전한 운전자이다. 다른 연령에서는, 남성과 여성은 동등하게 안전하다. 그리고 또 다른 연령에서는 남성이 더 안전하다. 또 다른 예로서, 지리학적 분석의 맥락에서, 위도와 경도 사이의 상호 작용이 있을 수 있다. 상호 작용 없이 위도와 경도의 피처만을 사용하면, 모델은 싱가포르가 1인당 매우 높은 GDP 대신 1인당 낮은 GDP를 갖는 것으로 예측할 수 있지만, 상호 작용 정보를 사용하면, 모델은 싱가포르가 1인당 높은 GDP를 갖는 것으로 올바르게 예측할 수 있다.For example, in the context of auto insurance, there may be an interaction between "age" and "gender" features. At some age, women are safer drivers than men. At other ages, men and women are equally safe. And at other ages, men are safer. As another example, in the context of geographic analysis, there may be an interaction between latitude and longitude. Using only features of latitude and longitude without interaction, the model can predict that Singapore has low GDP per capita instead of very high GDP per capita, but using interaction information, the model can predict that Singapore has high GDP per capita. can be predicted correctly.

일반적으로, 상호 작용을 검출하는 것은 2개 이상의 피처 사이의 통계적으로 중요한 상호 작용을 검출하는 방법을 수행하는 것을 포함한다. 일부 실시예에서, 엔진(110)은 (예를 들어, 방법(400)의 단계(408)에서) 데이터 세트를 평가할 때 상호 작용을 검출할 수 있다. 일부 실시예에서, 개별 모델링 기술은 (예를 들어, 모델링 기술 빌더(220)에 의해) 상호 작용을 위해 개별적으로 테스트하도록 구성될 수 있다. 전자의 접근법은 일반적인 해결책을 제공할 수 있으며, 후자의 접근법은 특정 상황을 다루는 데 더 많은 유연성을 제공할 수 있다.In general, detecting an interaction includes performing a method of detecting a statistically significant interaction between two or more features. In some embodiments, engine 110 may detect an interaction when evaluating the data set (eg, at step 408 of method 400 ). In some embodiments, individual modeling techniques may be configured to be individually tested for interaction (eg, by modeling technique builder 220 ). The former approach may provide a general solution, while the latter approach may provide more flexibility in dealing with specific situations.

일반적인 경우에, 엔진(110)은 (a) 어떤 피처의 서브세트가 상호 작용을 위한 후보인지, (b) 어떤 상호 작용 순서로 테스트할 것인지, (c) 어떤 특정 피처 조합을 테스트할 것인지를 결정할 수 있다. 각각의 결정에 대해, 엔진(100)은 과거 예측 문제에 대한 경험적 또는 메타-머신 학습의 결과 중 어느 하나를 사용할 수 있다. 예를 들어, 일부 연구 결과에 따르면 3-웨이(way) 이상의 상호 작용을 테스트하는 것은 예측 정확도가 거의 향상되지 않다. 따라서, 일부 실시예에서, 엔진(110)은 사용자에 의해 명시적으로 무시되지 않거나, 더 높은 차수의 상호 작용으로 모델링하는 것이 정확도를 향상시킬 것이라는 예외적인 경우가 되는 상당한 확률을 특정 문제가 갖는 것으로 메타-머신 학습 알고리즘이 나타내지 않는다면, 2-웨이 상호 작용에 대해서만 테스트하거나 2-웨이 상호 작용에 대해서만 테스트하는 것을 디폴트로 할 수 있다.In the general case, the engine 110 determines (a) which subset of features are candidates for interaction, (b) which interaction order to test in, and (c) which specific combination of features to test. can For each decision, the engine 100 may use either the results of empirical or meta-machine learning on past prediction problems. For example, some studies have shown that testing interactions over three-way yields little improvement in prediction accuracy. Thus, in some embodiments, the engine 110 determines that a particular problem has a significant probability that it is not explicitly overridden by the user, or is the exceptional case that modeling with higher order interactions will improve accuracy. If the meta-machine learning algorithm does not indicate, it can default to testing only for 2-way interactions or only testing for 2-way interactions.

잠재적 상호 작용의 세트를 결정한 후에, 엔진(110)은 상호 작용 검출 방법을 선택할 수 있다. 일부 경우에, 엔진(110)은 모든 가능한 상호 작용에 대해 동일한 상호 작용 검출 방법을 사용할 수 있다. 다른 경우에, 엔진(110)은 피처의 특성에 기초하여 상이한 피처의 조합에 대한 상이한 상호 작용 검출 방법을 선택할 수 있다. 상호 작용 방법의 몇 가지 예는 ANOVA, 부분 의존 함수(Partial Dependence Functions)(J.H. Friedman 및 B.E. Popescu, 규칙 앙상블을 통한 예측 학습, 2005), GUIDE(W. Loh, 편향되지 않은 변수 선택 및 상호 작용 검출을 갖는 회귀 트리, Statistica Sinica 12(2) 2002), Grove(D. Sorokina 등, 트리의 가산 그루브와의 통계적 상호 작용 검출, ICML, 2008), 및 FAST(Yin Lou 등, 쌍으로 된 상호 작용을 갖는 정확한 지능형 모델, KDD, 2013)를 포함한다.After determining the set of potential interactions, engine 110 may select an interaction detection method. In some cases, engine 110 may use the same interaction detection method for all possible interactions. In other cases, engine 110 may select different interaction detection methods for different combinations of features based on the characteristics of the features. Some examples of interaction methods are ANOVA, Partial Dependence Functions (J.H. Friedman and B.E. Popescu, Predictive Learning with Rule Ensembles, 2005), GUIDE (W. Loh, Unbiased Variable Selection and Interaction Detection). Regression trees with , Statistica Sinica 12(2) 2002), Grove (D. Sorokina et al., detecting statistical interactions with additive grooves in trees, ICML, 2008), and FAST (Yin Lou et al., paired interactions). Accurate intelligent model with having, KDD, 2013).

일부 실시예에서, 시스템(100)은 다른 머신 학습 기술을 실행하기 위해 전술한 동일한 메커니즘을 사용하여 상호 작용을 검출하는 이러한 기술을 수행한다. 엔진(110)이 상호 작용 검출 결과를 얻은 후에, 엔진은 (1) 중요한 상호 작용에 기초하여 새로운 피처를 구성하여 이를 데이터 세트에 추가하고 및/또는 (2) 모델링 기술이 상호 작용을 직접 취급할 수 있는 능력을 갖는 경우, 중요한 상호 작용의 목록을 파라미터로서 모델링 기술에 전달할 수 있다. 모델링 기술의 실행은 전술한 방식으로 계속될 수 있다.In some embodiments, system 100 performs these techniques of detecting interactions using the same mechanisms described above to implement other machine learning techniques. After the engine 110 obtains the interaction detection result, the engine (1) constructs new features based on the important interactions and adds them to the data set and/or (2) the modeling technique can handle the interactions directly. If you have the ability to do this, you can pass a list of important interactions as parameters to the modeling technique. Execution of the modeling technique may continue in the manner described above.

예측 성능 향상Better predictive performance

사용자는 배치 엔진을 사용하여(예를 들어, 모델 배치 엔진(140)을 사용하여) 독립적인 예측 서비스에 잠재적으로 모델을 맞춤화할 수 있다. 일부 경우에, 특정 모델을 사용하여 피처의 관측치로부터 예측 타겟을 계산하는 것은 상당한 연산 자원을 필요로 할 수 있다. 일부 실시예에서, 엔진(110)은, 특히 예측 서비스가 하나 이상의 모델에 대한 많은 모델 및/또는 빈번한 요청에 대한 요청을 취급할 때, 자원 소비를 제한(예를 들어, 최소화)하면서 적시의 응답을 제공하도록 구성된다.A user may use the placement engine (eg, using the model placement engine 140 ) to potentially customize the model to an independent prediction service. In some cases, calculating a prediction target from observations of a feature using a particular model may require significant computational resources. In some embodiments, engine 110 provides timely responses while limiting (eg, minimizing) resource consumption, particularly when the prediction service handles requests for many models and/or frequent requests for one or more models. is configured to provide

일례를 이하에 설명한다. 온라인 게임에서, 게임 공급자는 많은 상이한 유형의 게임을 지원할 수 있으며 각 유형의 게임의 많은 인스턴스와 각 인스턴스에서 많은 사용자가 플레이한다. 게임으로부터 사용자 만족도 및 수익을 증가(예를 들어, 최적화)시키기 위해, 이러한 공급자는 플레이하는 게임의 결과에 기초하여 사용자의 거동을 예측하기를 원할 수 있다. 이러한 공급자는 이러한 예측을 사용하여 플레이어에게 제공하거나 미래의 게임 경험을 조정할 수 있다. 따라서, 이러한 공급자는 수십 내지 수백개의 다른 모델에 의존할 수 있으며, 각각의 모델은 예를 들어, 하루 수천 내지 수백만의 예측을 하는 데 사용될 수 있다. 또한, 모델에 걸친 상대적인 예측 수요는 인구 통계학적 효과, 인기의 변화, 평균 게임 완료 시간의 차이 등으로 인해 시간이 지남에 따라 크게 변할 수 있다.An example is described below. In online games, game providers can support many different types of games, with many instances of each type of game and many users playing in each instance. In order to increase (eg, optimize) user satisfaction and revenue from games, such providers may want to predict the behavior of users based on the outcome of the games they play. These providers may use these predictions to provide to players or adjust future gaming experiences. Thus, such providers may rely on tens to hundreds of different models, each of which can be used to make, for example, thousands to millions of predictions per day. In addition, the relative demand for prediction across models can change significantly over time due to demographic effects, changes in popularity, differences in average game completion times, etc.

운영 상의 예측을 하기 위해, 사용자는 모델 구축 컴퓨팅 인프라스트럭처와는 완전히 별개인 독립 예측 서비스를 원할 수 있다. 독립 예측 서비스는 다른 컴퓨팅 환경에서 실행되거나, 공유 컴퓨팅 환경 내에서 별개의 컴포넌트로 관리될 수 있다. 일단 인스턴스화되면, 서비스의 실행, 보안 및 모니터링이 모델 구축 환경과 완전히 분리되어 사용자가 독립적으로 이를 배포 및 관리할 수 있게 한다.To make operational predictions, users may want a standalone prediction service that is completely separate from the model building computing infrastructure. The independent prediction service may run in different computing environments, or may be managed as separate components within the shared computing environment. Once instantiated, the execution, security, and monitoring of the service is completely decoupled from the model building environment, allowing users to deploy and manage it independently.

서비스를 인스턴스화한 후에, 배치 엔진은 사용자로 하여금 맞춤화된 모델을 서비스에 설치하도록 허용할 수 있다. 성능을 향상(예를 들어, 최적화)시키기 위해, 모델 맞춤화에 적절한 모델링 기술의 구현은 예측을 위해 차선적일 수 있다. 예를 들어 모델을 맞춤화하는 것은 동일한 알고리즘을 반복적으로 실행하는 것을 필요로 하므로, 알고리즘의 빠른 병렬 실행을 가능하게 하기 위해 상당한 양의 오버헤드를 투자하는 것이 종종 가치가 있다. 그러나 예측 요청의 예상 레이트가 매우 높지 않는 경우, 동일한 오버헤드는 독립적인 예측 서비스에 대해 가치가 없을 수 있다. 일부 경우에, 모델링 기술 개발자는 예측 환경에서 더 나은 성능 특성을 제공하는 컴포넌트 실행 작업 중 하나 이상의 특수 버전을 제공할 수도 있다. 특히, 고도의 병렬 실행 또는 특수 프로세서에서의 실행을 위해 설계된 구현은 예측 성능에 유리할 수 있다. 마찬가지로, 모델링 기술이 프로그래밍 언어로 특정된 작업을 포함하는 경우, 서비스 시작 또는 해당 모델로부터 예측에 대한 초기 요청이 있을 때까지 기다리지 않고 서비스 인스턴스화시 작업을 사전-컴파일하는 것이 성능 향상을 제공할 수 있다.After instantiating the service, the deployment engine may allow the user to install a customized model into the service. To improve (eg, optimize) performance, implementation of modeling techniques appropriate for model customization may be sub-optimal for prediction. Customizing a model, for example, requires executing the same algorithm repeatedly, so it is often worth investing a significant amount of overhead to enable fast parallel execution of the algorithm. However, if the expected rate of prediction requests is not very high, the same overhead may not be worthwhile for an independent prediction service. In some cases, modeling technology developers may provide specialized versions of one or more of the component execution tasks that provide better performance characteristics in a predictive environment. In particular, implementations designed for highly parallel execution or execution on specialized processors may benefit predictive performance. Similarly, if the modeling technique involves tasks specified in a programming language, pre-compiling the tasks upon service instantiation rather than waiting for service startup or an initial request for predictions from that model may provide a performance improvement. .

또한, 모델 맞춤 작업은 일반적으로 예측 서비스와는 다른 컴퓨팅 인프라스트럭처 사용한다. 모델링 기술 실행 중에 에러로부터 클라우드 인프라스트럭처를 보호하고 클라우드의 다른 사용자로부터 모델링 기술에 대한 액세스를 방지하기 위해, 모델링 기술은 모델 맞춤화 중에 보안 컴퓨팅 컨테이너에서 실행할 수 있다. 그러나, 예측 서비스는 종종 전용 머신 또는 클러스터에서 실행된다. 따라서, 보안 컨테이너 계층을 제거하는 것은 임의의 실제적인 단점 없이 오버헤드를 감소시킬 수 있다.Additionally, model fitting tasks typically use a different computing infrastructure than prediction services. To protect the cloud infrastructure from errors during the modeling technique execution and to prevent access to the modeling technique from other users in the cloud, the modeling technique can run in a secure computing container during model customization. However, prediction services often run on dedicated machines or clusters. Thus, eliminating the secure container layer can reduce overhead without any practical disadvantages.

따라서, 모델의 모델링 기술, 기대되는 로드 및 예측을 위한 타겟 컴퓨팅 환경의 특성에 의해 실행되는 특정 작업에 기초하여, 배치 엔진은 모델을 패키징하고 배치하기 위한 규칙의 세트를 사용할 수 있다. 이러한 규칙은 실행을 최적화할 수 있다.Thus, based on the specific tasks executed by the model's modeling techniques, expected loads, and the characteristics of the target computing environment for prediction, the deployment engine may use a set of rules for packaging and deploying the model. These rules can optimize execution.

주어진 예측 서비스가 복수의 모델을 실행할 수 있기 때문에, 서비스는 각 모델에 대한 예측 요청에 걸쳐 컴퓨팅 자원을 할당할 수 있다. 하나 이상의 서버 머신에 배치하고 컴퓨팅 클러스터에 배치하는 2개의 기본 경우가 있다.Because a given prediction service may run multiple models, the service may allocate computing resources across prediction requests for each model. There are two basic cases of deploying on one or more server machines and deploying on a compute cluster.

서버로의 배치의 경우, 과제는 복수의 서버 중에 요청을 할당하는 방식이다. 예측 서비스는 몇몇 유형의 사전 정보를 가질 수 있다. 이러한 정보는 (a) 구성된 각 모델에 대한 예측을 실행하는 데 걸리는 시간의 추정 (b) 상이한 시간에 구성된 각 모델에 대한 요청의 예상 빈도(c) 모델 실행의 원하는 우선 순위를 포함할 수 있다. 실행 시간의 추정은 하나 이상의 조건 하에서 각각의 모델에 대한 예측 코드의 실제 실행 속도를 측정하는 것에 기초하여 계산될 수 있다. 모델 실행의 원하는 우선 순위는 서비스 관리자에 의해 특정될 수 있다. 요청의 예상 빈도는 해당 모델의 이력 데이터, 메타-머신 학습 모델을 기반으로 한 예상 또는 관리자에 의해 제공되는 것으로부터 계산될 수 있다.In the case of deployment to servers, the challenge is how to allocate requests among multiple servers. A prediction service may have several types of prior information. This information may include (a) an estimate of the time it will take to execute the predictions for each model constructed, (b) the expected frequency of requests for each model constructed at different times, and (c) the desired priority of model execution. The estimate of the execution time may be calculated based on measuring the actual execution speed of the predictive code for each model under one or more conditions. The desired priority of model execution may be specified by the service manager. The expected frequency of requests can be calculated from historical data of that model, predictions based on meta-machine learning models, or provided by administrators.

서비스는 각각의 모델에 초기에 할당될 수 있는 모든 이용 가능 서버의 집계 컴퓨팅 능력의 부분을 연산하기 위해 이러한 요인의 일부 또는 전부를 조합하는 목적 함수를 포함할 수 있다. 서비스가 요청을 받아서 실행하면, 요청의 실행 시간 및 예상 빈도에 대한 갱신된 정보를 자연스럽게 얻는다. 따라서, 서비스는 이러한 부분을 다시 계산하고 이에 따라 서버에 모델을 재할당할 수 있다.The service may include an objective function that combines some or all of these factors to compute a portion of the aggregate computing power of all available servers that may be initially allocated to each model. When a service receives a request and executes it, it naturally gets updated information about the execution time and expected frequency of the request. Thus, the service can recompute these parts and reallocate the model to the server accordingly.

배치된 예측 서비스는 2개의 상이한 유형의 서버 프로세스, 즉 라우터 및 작업자를 가질 수 있다. 하나 이상의 라우터가 예측에 대한 요청을 수용하고 이를 작업자에게 할당하는 라우팅 서비스를 형성할 수 있다. 들어오는 요청은 어떤 예측 모델을 사용할 것인지를 나타내는 모델 식별자, 어떤 사용자 또는 소프트웨어 시스템이 요청을 하는지를 나타내는 사용자 또는 클라이언트 식별자, 그리고 그 모델에 대한 예측자 변수의 하나 이상의 벡터를 가질 수 있다.A deployed prediction service can have two different types of server processes: routers and workers. One or more routers may form a routing service that accepts requests for predictions and assigns them to workers. An incoming request may have one or more vectors of a model identifier indicating which predictive model to use, a user or client identifier indicating which user or software system is making the request, and predictor variables for that model.

요청이 전용 예측 서비스에 들어오면, 그 라우팅 서비스는 모델 식별자, 사용자 또는 클라이언트 식별자, 및 예측자 변수의 벡터의 수의 일부 조합을 검사할 수 있다. 그 다음, 라우팅 서비스는 (1) 주어진 모델을 실행하는 데 및/또는 (2) 주어진 사용자 또는 클라이언트에 대해 사용되는 명령어 및 데이터에 대한 서버 캐시 히트를 증가(예를 들어, 최대화)하기 위해 요청을 작업자에게 할당할 수 있다. 또한 라우팅 서비스는 대기 시간과 스루풋(throughput)의 균형을 맞추기 위해 각 작업자에게 제출되는 배치(batch) 크기의 혼합을 달성하도록 예측자 변수의 벡터 수를 고려할 수 있다.When a request comes in to a dedicated prediction service, that routing service may examine some combination of a model identifier, a user or client identifier, and a number of vectors of predictor variables. The routing service then issues requests to (1) execute the given model and/or (2) increase (eg, maximize) server cache hits for instructions and data used for a given user or client. can be assigned to workers. The routing service may also consider the number of vectors of predictor variables to achieve a mix of batch sizes submitted to each worker to balance latency and throughput.

작업자에 걸친 모델에 대한 요청을 할당하기 위한 알고리즘의 예는 라운드-로빈(round-robin), 모델 계산 강도 및/또는 작업자의 연산력에 기초한 가중화된 라운드 로빈 및 보고된 로드에 기초한 동적 할당을 포함할 수 있다. 지정된 서버에 대한 요청의 신속한 라우팅을 용이하게 하기 위해, 라우팅 서비스는 관측된 특성(예를 들어, 모델 식별자)의 주어진 동일한 세트에 대해 동일한 서버를 선택하는 해시 함수를 사용할 수 있다. 해시 함수는 간단한 해시 함수 또는 일관된 해시 함수일 수 있다. 일관된 해시 함수는 노드(이 경우 작업자에 대응)의 수가 변할 때 더 적은 오버 헤드를 필요로 한다. 따라서, 작업자가 다운되거나 새로운 작업자가 추가되면, 일관된 해시 함수가 다시 연산되어야 하는 해시 키의 수를 감소시킬 수 있다.Examples of algorithms for allocating requests for models across workers include round-robin, weighted round robin based on model computational strength and/or computational power of workers, and dynamic assignment based on reported load. can do. To facilitate rapid routing of requests to a specified server, the routing service may use a hash function that selects the same server for a given identical set of observed characteristics (eg, model identifiers). The hash function may be a simple hash function or a consistent hash function. A consistent hash function requires less overhead when the number of nodes (corresponding to workers in this case) changes. Thus, if a worker goes down or a new worker is added, a consistent hash function can reduce the number of hash keys that must be re-computed.

예측 서비스는, 이용 가능한 서비스 중에 예측 요청을 지능적으로 분배함으로써 성능을 향상(예를 들어, 최적화)시키는 것에 추가하여, 각 작업자가 각 모델을 실행하는 어떻게 실행하는지를 지능적으로 구성함으로써 개개의 모델의 성능을 향상(예를 들어, 최적화)시킬 수 있다. 예를 들어, 주어진 서버가 몇몇 상이한 모델에 대한 요청의 혼합을 수신하면, 각 요청에 대한 모델 로딩 및 언로딩은 상당한 오버 헤드를 발생시킬 수 있다. 그러나 배치 프로세싱(batch processing)에 대한 요청을 집계하는 것은 상당한 대기 시간을 발생시킬 수 있다. 일부 실시예에서, 관리자가 모델에 대한 대기 시간 허용 오차를 특정하면, 서비스는 이러한 상충 관계를 지능적으로 수행할 수 있다. 예를 들어, 긴급 요청은 단지 100 밀리초의 대기 시간 허용 오차를 가질 수 있으며, 이 경우 서버는 하나 또는 많아 봐야 몇 개의 요청만을 프로세싱할 수 있다. 반대로, 대기 시간의 허용 오차가 2초이면 수백의 배치 크기를 가능하게 할 수 있다. 오버헤드로 인해, 대기 시간 허용 오차를 2배 증가시키면 스루풋이 10x에서 100x까지 증가할 수 있다.In addition to improving (e.g. optimizing) performance by intelligently distributing prediction requests among available services, the prediction service can also improve the performance of individual models by intelligently configuring how each worker runs each model. can be improved (eg, optimized). For example, if a given server receives a mix of requests for several different models, loading and unloading models for each request can incur significant overhead. However, aggregating requests for batch processing can introduce significant latency. In some embodiments, once an administrator specifies a latency tolerance for a model, the service can intelligently make this trade-off. For example, an urgent request may have a latency tolerance of only 100 milliseconds, in which case the server may only process one or at most a few requests. Conversely, a two-second latency tolerance can enable hundreds of batch sizes. Due to overhead, doubling the latency tolerance can increase throughput by 10x to 100x.

마찬가지로, 운영 체제 스레드(thread)를 사용하는 것은, 스레드 설정 및 초기화 오버헤드로 인해, 대기 시간을 증가시키면서 스루풋을 향상시킬 수 있다. 일부 경우에, 예측이 대기 시간에 극도로 민감할 수 있다. 주어진 모델에 대한 모든 요청이 대기 시간에 민감할 가능성이 있는 경우, 서비스는 이러한 요청을 취급하는 서버를 단일 스레드 모드에서 동작하도록 구성할 수 있다. 또한, 요청의 서브세트만이 대기 시간에 민감할 가능성이 있는 경우, 서비스는 요청자로 하여금 주어진 요청에 민감한 것으로 플래깅하도록 허용할 수 있다. 이 경우, 서버는 특정 요청을 서비스하면서 단일 스레드 모드로만 동작할 수 있다.Similarly, using operating system threads can improve throughput while increasing latency due to thread setup and initialization overhead. In some cases, predictions may be extremely sensitive to latency. If all requests for a given model are likely to be latency sensitive, the service can configure the server handling these requests to operate in single-threaded mode. Also, if only a subset of the requests are likely to be latency sensitive, the service may allow the requester to flag a given request as sensitive. In this case, the server can only operate in single-threaded mode while servicing specific requests.

일부 경우에, 사용자의 조직은, 조직이 가능한 한 신속히 계산하기 위해 분산 컴퓨팅 클러스터를 사용하기를 원한다는 예측의 배치(batch)를 가질 수 있다. 분산 컴퓨팅 프레임워크(예를 들어, Apache Spark)는 일반적으로 조직으로 하여금 프레임워크를 실행하는 클러스터를 설정할 수 있게 하며, 프레임워크와 함께 작동하도록 설계된 임의의 프로그램은 그 후 데이터 및 실행 가능 명령어로 구성된 작업을 제출할 수 있다.In some cases, a user's organization may have a batch of predictions that the organization would like to use a distributed computing cluster to compute as quickly as possible. Distributed computing frameworks (e.g. Apache Spark) generally allow organizations to set up clusters that run the framework, and any program designed to work with the framework is then configured with data and executable instructions. You can submit your work.

모델에 대한 하나의 예측의 실행은 그 모델 또는 다른 임의의 모델의 결과에 대한 또 다른 예측의 결과에 영향을 미치지 않기 때문에, 예측은 클러스터 컴퓨팅의 맥락에서 비상태 동작이고, 따라서 일반적으로 병렬화하기 용이하다. 따라서, 데이터 및 실행 가능 명령의 배치가 주어지면, 프레임워크의 분할 및 할당 알고리즘의 정상적인 거동은 선형 스케일링으로 귀결될 수 있다.Because the execution of one prediction on a model does not affect the results of another prediction on the results of that model or any other model, prediction is a stateless operation in the context of cluster computing, and thus is generally easy to parallelize. do. Thus, given the arrangement of data and executable instructions, the normal behavior of the partitioning and allocation algorithm of the framework can result in linear scaling.

일부 경우에, 예측을 하는 것은 많은 단계에서 데이터가 생성되고 소비되는 대규모 작업 흐름의 일부일 수 있다. 이러한 경우, 예측 작업은 Apache Kafka와 같은 게시-가입 메커니즘을 통해 다른 동작과 통합될 수 있다. 예측 서비스는 예측이 필요한 새로운 관측치를 생성하는 채널에 가입한다. 서비스가 예측을 한 후에, 다른 프로그램이 소비할 수 있는 하나 이상의 채널에 이를 게시한다.In some cases, making predictions can be part of a larger workflow where data is generated and consumed in many steps. In this case, the prediction task can be integrated with other behaviors through a publish-subscribe mechanism such as Apache Kafka. Prediction services subscribe to channels that generate new observations that need predictions. After the service makes a prediction, it publishes it on one or more channels where other programs can consume it.

성능 향상performance improvement

피팅 모델링 기술 및/또는 다수의 대안 기술 중의 검색은 연산 집약적일 수 있다. 연산 자원은 비용이 많이 들 수 있다. 예측 모델을 생성하기 위한 시스템(100)의 일부 실시예는 자원 소비를 감소시키기 위한 기회를 식별한다.Searching among fitting modeling techniques and/or multiple alternative techniques may be computationally intensive. Computational resources can be expensive. Some embodiments of the system 100 for generating a predictive model identify opportunities to reduce resource consumption.

사용자 선호도에 기초하여, 엔진(110)은 실행 시간 및 컴퓨팅 자원의 소비를 감소시키기 위해 모델에 대한 검색을 조정할 수 있다. 일부 경우에, 예측 문제는 많은 트레이닝 데이터를 포함할 수 있다. 이러한 경우 상호 유효성 검증의 이점은 통상적으로 모델 편향을 감소시키는 측면에서 더 낮다. 따라서, 통상적으로 데이터 양의 5배에서 10배에 달하는 1회의 연산 시간이 데이터 양의 1/5에서 1/10에 달하는 5 내지 10회보다 훨씬 더 적기 때문에, 사용자는 각 교차 검증 폴드가 아닌 모든 트레이닝 데이터에 한번에 모델을 맞춤화하는 것을 선호할 수 있다.Based on user preferences, engine 110 may adjust the search for the model to reduce execution time and consumption of computing resources. In some cases, a prediction problem may involve a lot of training data. The benefit of mutual validation in such cases is usually lower in terms of reducing model bias. Thus, since one operation time, typically 5 to 10 times the data amount, is much less than 5 to 10 times the data amount 1/5 to 1/10, the user can read all but each cross-validation fold. You may prefer to fit the model to the training data all at once.

사용자가 비교적 큰 트레이닝 세트를 갖지 않는 경우에도, 사용자는 여전히 시간 및 자원을 절약하기를 원할 수 있다. 이러한 경우, 엔진(110)은 몇몇의 보다 적극적인 검색 접근법을 사용하는 "그리디어(greedier)" 옵션을 제공할 수 있다. 먼저, 엔진(110)은 가능한 모델링 기술(예를 들어, 예상되는 성능이 비교적 높은 것만)의 더 작은 서브세트를 시도할 수 있다. 둘째, 엔진(110)은 트레이닝 및 평가의 각 라운드에서 보다 열악한 모델을 보다 적극적으로 제거할 수 있다. 셋째, 엔진(110)은 각 모델에 대한 최적의 하이퍼-파라미터를 검색할 때보다 큰 스텝을 취할 수 있다.Even if the user does not have a relatively large training set, the user may still want to save time and resources. In this case, engine 110 may provide a “greedier” option that uses some more aggressive search approach. First, engine 110 may try a smaller subset of possible modeling techniques (eg, only those with relatively high expected performance). Second, engine 110 can more aggressively remove poorer models in each round of training and evaluation. Third, the engine 110 may take larger steps than when searching for the optimal hyper-parameter for each model.

일반적으로, 보다 양호한(예를 들어, 최적의) 하이퍼-파라미터를 검색하는 것은 비용이 많이 들 수 있다. 따라서, 사용자가 엔진(110)으로 하여금 광범위한 스펙트럼의 잠재적인 모델을 평가하게 하고 이를 적극적으로 제거하지 않기를 원한다고 하더라도, 엔진은 하이퍼-파라미터 검색을 제한(예를 들어, 최적화)함으로써 자원을 여전히 절약할 수 있다. 이 검색 비용은 일반적으로 데이터 세트의 크기에 비례한다. 하나의 전략은 데이터 세트의 작은 부분에 하이퍼-파라미터를 튜닝한 다음 이 파라미터를 전체 데이터 세트로 외삽하는 것이다. 일부 경우, 더 많은 양의 데이터를 고려하여 조정이 이루어진다. 일부 실시예에서, 엔진(110)은 2개의 전략 중 하나를 사용할 수 있다. 먼저, 엔진(110)은 그 모델링 기술에 대한 경험에 기초하여 조정을 수행할 수 있다. 둘째, 엔진(110)은 메타-머신 학습에 참여하여 각 모델링 기술의 하이퍼-파라미터가 데이터 세트 크기에 따라 어떻게 변하는지 추적하고 하이퍼-파라미터의 메타 예측 모델을 구축한 다음 사용자가 상충 관계를 만들기를 원하는 경우 메타 모델을 적용할 수 있다.In general, searching for better (eg, optimal) hyper-parameters can be expensive. Thus, even if the user wants the engine 110 to evaluate a broad spectrum of potential models and not actively remove them, the engine will still save resources by limiting (eg, optimizing) hyper-parameter searches. can The cost of this search is usually proportional to the size of the data set. One strategy is to tune hyper-parameters to a small part of the data set and then extrapolate these parameters to the entire data set. In some cases, adjustments are made to account for larger amounts of data. In some embodiments, engine 110 may use one of two strategies. First, the engine 110 may make adjustments based on its experience with its modeling techniques. Second, the engine 110 participates in meta-machine learning to track how the hyper-parameters of each modeling technique change with the data set size, builds a meta-prediction model of the hyper-parameters, and then asks the user to make trade-offs. You can apply a meta model if you want.

카테고리 예측 문제로 작업할 때, 소수 클래스 및 다수 클래스가 있을 수 있다. 소수 클래스는 사기 검출의 경우와 같이 훨씬 더 작지만 상대적으로 더 중요할 수 있다. 일부 실시예에서, 엔진(110)은 다수 클래스를 "다운-샘플링"하여, 그 클래스에 대한 트레이닝 관측의 수가 소수 클래스에 대한 것과 더욱 유사하게 된다. 일부 경우에, 모델링 기술은 모델 맞춤화 동안 자동으로 이러한 가중치를 직접 수용할 수 있다. 모델링 기술이 이러한 가중치를 수용하지 않는다면, 엔진(110)은 다운-샘플링의 양에 비례하여 사후-맞춤 조정을 할 수 있다. 이 접근법은 훨씬 더 짧은 실행 시간과 낮은 자원 소비에 대해 일부 정확도를 희생시킬 수 있다.When working with categorical prediction problems, there can be a minority class and a majority class. A minority class can be much smaller, but relatively more important, as in the case of fraud detection. In some embodiments, engine 110 "down-samples" the majority class so that the number of training observations for that class is more similar to that for the minority class. In some cases, modeling techniques can directly accommodate these weights automatically during model customization. If the modeling technique does not accommodate these weights, the engine 110 may make a post-fit adjustment proportional to the amount of down-sampling. This approach may sacrifice some accuracy for a much shorter execution time and lower resource consumption.

일부 모델링 기술은 다른 것보다 더욱 효율적으로 실행될 수 있다. 예를 들어, 일부 모델링 기술은 병렬 컴퓨팅 클러스터 또는 특수 프로세서가 있는 서버에서 실행되도록 최적화될 수 있다. 각 모델링 기술의 메타데이터는 이러한 임의의 성능 이점을 나타낼 수 있다. 엔진(110)이 컴퓨팅 작업을 할당할 때, 현재 이용 가능한 컴퓨팅 환경에 그 이점이 적용되는 모델링 기술에 대한 작업을 검출할 수 있다. 그 다음, 각 검색 라운드 동안, 엔진(110)은 이러한 작업에 대한 데이터 세트의 더 큰 덩어리를 사용할 수 있다. 그 후, 이러한 모델링 기술이 더 빨리 완료될 수 있다. 또한, 그 정확도가 충분히 좋으면, 상대적으로 열등하게 수행하고 있는 다른 모델링 기술을 테스트할 필요도 없을 수 있다.Some modeling techniques can be implemented more efficiently than others. For example, some modeling techniques may be optimized to run on parallel computing clusters or servers with specialized processors. The metadata of each modeling technique may represent any of these performance benefits. When engine 110 assigns computing tasks, it may detect tasks for modeling techniques that benefit from currently available computing environments. Then, during each search round, the engine 110 may use a larger chunk of the data set for this task. After that, these modeling techniques can be completed more quickly. Also, if the accuracy is good enough, there may be no need to test other modeling techniques that perform relatively poorly.

사용자 인터페이스(UI) 향상User Interface (UI) Enhancements

엔진(110)은 모델 구축 이전에 사용자로부터 보다 많은 정보를 추출하여 더 나은 예측 모델을 사용자가 생성하도록 도울 수 있으며, 모델 맞춤 후에 사용자에게 모델 성능의 더 나은 이해를 제공할 수 있다.The engine 110 may extract more information from the user prior to model building to help the user create a better predictive model, and may provide the user with a better understanding of model performance after model fitting.

일부 경우에, 사용자는 정확한 예측 모델에 대한 검색을 보다 잘 지시하기에 적절한 데이터 세트에 관한 추가 정보를 가질 수 있다. 예를 들어, 사용자는, 특정 관측이 특별한 중요성을 가지며 그 중요성을 나타내기를 원한다는 것을 알 수 있다. 엔진(110)은 사용자로 하여금 이 목적을 위해 새로운 변수를 용이하게 생성하도록 허용할 수 있다. 예를 들어, 하나의 합성 변수는 이러한 분할에 랜덤하게 할당하는 대신 엔진이 트레이닝, 검증 또는 홀드아웃 데이터 분할의 일부로서 특정 관측치를 사용해야 함을 나타낼 수 있다. 이 기능은, 특정 값이 빈번하지 않게 발생하고 대응 관측이 다른 분할에 신중하게 할당되어야 하는 상황에서 유용할 수 있다. 이 기능은, 사용자가 다른 머신 학습 시스템을 사용하여 모델을 트레이닝하고 트레이닝, 검증 및 홀드아웃 분할이 동일한 비교를 수행하기를 원하는 상황에서 유용할 수 있다.In some cases, the user may have additional information about the appropriate data set to better direct the search for an accurate predictive model. For example, a user may know that a particular observation has special importance and would like to indicate that importance. Engine 110 may allow a user to easily create new variables for this purpose. For example, one synthetic variable may indicate that the engine should use a particular observation as part of a training, validation, or holdout data partition instead of randomly assigning it to this partition. This feature can be useful in situations where certain values occur infrequently and corresponding observations must be carefully assigned to different partitions. This feature can be useful in situations where a user wants to train a model using another machine learning system and wants to perform comparisons where training, validation, and holdout partitions are identical.

마찬가지로, 특정 관측은, 사용자가 추가 가중치를 할당하기를 원하는 특히 중요한 이벤트를 나타낼 수 있다. 따라서, 데이터 세트에 삽입된 추가 변수는 각 관측의 상대적 가중치를 나타낼 수 있다. 엔진(110)은 더 높게 가중화된 조건 하에서 더욱 정확한 예측을 생성하는 것을 목적으로 모델을 트레이닝하고 그 정확도를 계산할 때 이 가중치를 사용할 수 있다.Likewise, certain observations may indicate particularly important events for which the user wishes to assign additional weights. Thus, an additional variable inserted into the data set can represent the relative weight of each observation. Engine 110 may use these weights when training a model and calculating its accuracy for the purpose of producing more accurate predictions under higher weighted conditions.

다른 경우에서, 사용자는, 특정의 피처가 모델에서 어떻게 거동해야 하는지에 대한 사전 정보를 가질 수 있다. 예를 들어, 사용자는, 특정 피처가 특정 범위에서 예측 타겟에 단조로운 영향을 가져야 함을 알 수 있다. 자동차 보험에서, 일반적으로 사고의 기회는 30세 이후의 나이에 따라 단조 증가한다고 믿어지고 있다. 또 다른 예는 그렇지 않은 연속 변수에 대한 밴드를 생성하는 것이다. 개인 소득은 계속되지만, $10K에서 $100K까지 증분하는 것과 같은 밴드, $25K에서 $250K까지의 밴드, 및 $250K 초과의 모든 수입의 밴드에 값을 할당하는 분석 규칙이 있다. 그런 다음 데이터 세트에 대한 제한이 특정 피처에 대한 제약을 필요로 하는 경우가 있다. 때때로, 카테고리 변수는 데이터 세트의 크기에 비해 매우 많은 수의 값을 가질 수 있다. 사용자는, 엔진(110)이 가능한 카테고리의 특정 수보다 많은 카테고리 피처를 무시해야 하고 카테고리의 수를 가장 빈번한 X로 제한하고, 다른 모든 값을 "다른" 카테고리에 할당해야 한다는 것을 나타내기를 원할 수 있다. 이러한 모든 상황에서, 사용자 인터페이스는 (예를 들어, 방법(400)의 단계(412)에서) 검출된 각 피처에 대해 이 정보를 특정하는 옵션을 사용자에게 제시할 수 있다.In other cases, the user may have prior information about how a particular feature should behave in the model. For example, a user may know that a particular feature should have a monotonic effect on the prediction target in a certain range. In auto insurance, it is generally believed that the chance of an accident increases monotonically with age after 30. Another example is to create bands for non-continuous variables. Personal income continues, but there are analysis rules that assign values to bands such as increments from $10K to $100K, bands from $25K to $250K, and bands of all income above $250K. Then there are cases where constraints on the data set require constraints on specific features. Sometimes, a categorical variable can have a very large number of values compared to the size of the data set. A user may wish to indicate that the engine 110 should ignore category features greater than a certain number of possible categories, limit the number of categories to the most frequent X, and assign all other values to "other" categories. . In all such situations, the user interface may present the user with an option to specify this information for each detected feature (eg, at step 412 of method 400 ).

사용자 인터페이스는 피처 변환에서 가이드되는 지원을 제공할 수 있다. 예를 들어, 사용자는 연속 변수를 카테고리 변수로 변환하기를 원할 수 있지만, 해당 변수에 대한 표준 규칙은 없을 수 있다. 분산의 형태를 분석함으로써, 엔진(110)은 최적의 갯수의 카테고리 밴드 및 각 밴드 사이의 경계를 정의하는 분산에 "매듭"을 배치할 포인트를 선택할 수 있다. 선택적으로, 사용자는 매듭을 추가하거나 삭제할 뿐만 아니라 매듭의 위치를 이동시킴으로써 사용자 인터페이스의 이러한 디폴트를 무시할 수 있다.The user interface may provide guided assistance in feature transformation. For example, a user may want to convert a continuous variable to a categorical variable, but there may not be a standard rule for that variable. By analyzing the shape of the variance, the engine 110 can select an optimal number of categorical bands and points to place a “knot” in the variance that defines the boundary between each band. Optionally, the user can override this default of the user interface by adding or deleting knots as well as moving the position of the knots.

마찬가지로, 이미 카테고리형인 피처에 대해, 엔진(110)은 하나 이상의 카테고리를 단일 카테고리로 조합함으로써 표현을 단순화할 수 있다. 각각의 관측된 카테고리의 상대적인 빈도 및 다른 피처의 값에 대해 나타나는 빈도에 기초하여, 엔진(110)은 카테고리를 조합하기 위한 최적의 방식을 계산할 수 있다. 선택적으로, 사용자는 조합된 카테고리로부터 원래 카테고리를 제거하고 및/또는 기존 카테고리를 조합된 카테고리에 넣음으로써 이러한 계산을 무시할 수 있다.Likewise, for features that are already categorical, engine 110 may simplify the presentation by combining one or more categories into a single category. Based on the relative frequency of each observed category and the frequency it appears for values of other features, engine 110 may calculate an optimal way to combine categories. Optionally, the user can override this calculation by removing the original category from the combined category and/or putting the existing category into the combined category.

특정의 경우에, 예측 문제는 불규칙한 간격으로 발생하는 이벤트를 포함할 수 있다. 이러한 경우, 이러한 이벤트가 특정 시간 프레임 내에서 얼마나 많이 발생했는지를 포착하는 새로운 피처를 자동으로 생성하는 것이 유용할 수 있다. 예를 들어, 보험 예측 문제에서, 데이터 세트는 보험 계약자가 보험금을 청구할 때마다 기록을 보유할 수 있다. 그러나 장래의 위험을 예측하는 모델을 구축할 때, 과거 X년 동안 보험 계약자가 얼마나 많은 보험금을 가졌는지를 고려하는 것이 더 유용할 수 있다. 엔진은 엔티티에 대응하는 기록과 이벤트에 대응하는 다른 기록 사이의 데이터 구조 관계를 검출함으로써 (예를 들어, 방법(400)의 단계(408)에서) 데이터 세트를 평가할 때 이러한 상황을 검출할 수 있다. (예를 들어, 단계(410)에서) 데이터 세트를 사용자에게 제시할 때, 사용자 인터페이스는 이러한 피처를 자동으로 생성하거나 제시할 수 있다. 이러한 변수의 장래의 이벤트의 발생 사이의 통계적인 의존성을 최대화하도록 계산되었거나, 일부 다른 경험을 사용하여 이벤트가 발생하는 빈도에 기초하여 시간 프레임 임계값을 또한 제안할 수 있다. 사용자 인터페이스는 또한 사용자로 하여금 그러한 피처의 생성을 무시하고, 그러한 피처의 생성을 강제하며, 제시된 시간 프레임 임계값을 무시하게 하도록 허용할 수 있다.In certain cases, prediction problems may include events that occur at irregular intervals. In such cases, it can be useful to automatically create new features that capture how many times these events have occurred within a particular time frame. For example, in an insurance forecasting problem, a data set may hold a record of each time a policyholder makes an insurance claim. However, when building a model that predicts future risk, it may be more useful to consider how much insurance the policyholder had in the past X years. The engine may detect this situation when evaluating a data set (eg, at step 408 of method 400 ) by detecting a data structure relationship between a record corresponding to an entity and another record corresponding to an event. . Upon presenting the data set to the user (eg, at step 410 ), the user interface may automatically create or present such features. Calculated to maximize the statistical dependence between occurrences of future events of these variables, or some other experience may also be used to suggest time frame thresholds based on the frequency at which events occur. The user interface may also allow the user to override the creation of such features, force the creation of such features, and override the presented time frame threshold.

시스템이 모델에 기초하여 예측을 행하는 경우, 사용자는 이러한 예측을 검토하고 특이한 것을 탐색할 수 있다. 예를 들어, 사용자 인터페이스는 모델에 대한 예측의 전부 또는 서브세트의 목록을 제공하고, 예측자의 값의 크기 또는 그 값을 가질 낮은 확률의 관점에서 어떤 것이 극단적이었는지를 나타낼 수 있다. 또한, 극단적인 값에 대한 이유에 대한 직관을 제공하는 것도 가능한다. 예를 들어, 자동차 보험 위험 모델에서 특히 높은 값은 "연령 < 25세 결혼 상태 = 독신"인 이유를 가질 수 있다.When the system makes predictions based on the model, users can review these predictions and search for outliers. For example, the user interface may provide a list of all or a subset of the predictions for the model and indicate which ones were extreme in terms of the magnitude of the predictor's value or the low probability of having that value. It is also possible to provide an intuition as to the reasons for the extreme values. For example, a particularly high value in an auto insurance risk model may have the reason "age < 25 marital status = single".

일부 실시예의 추가 설명Further description of some embodiments

본원에 제공되는 예가 별개의 컴퓨터 상에 존재하는 모듈 또는 별개의 컴퓨터에 의해 수행되는 동작을 설명할 수 있지만, 이들 컴포넌트의 기능은 단일 컴퓨터 또는 임의의 더 많은 수의 분산 방식의 컴퓨터에 구현될 수 있다는 것이 이해되어야 한다.Although the examples provided herein may describe modules residing on separate computers or operations performed by separate computers, the functionality of these components may be implemented on a single computer or any larger number of computers in a distributed fashion. It should be understood that there is

전술한 실시예는 임의의 다수의 방식으로 구현될 수 있다. 예를 들어, 실시예는 하드웨어, 소프트웨어 또는 이들의 조합을 사용하여 구현될 수 있다. 소프트웨어로 구현될 때, 소프트웨어 코드는 단일 컴퓨터에서 제공되든 또는 복수의 컴퓨터 중에 분산되어 있든, 임의의 적절한 프로세서 또는 프로세서의 컬렉션에서 실행될 수 있다. 또한, 컴퓨터는 랙(rack)-장착 컴퓨터, 데스크톱 컴퓨터, 랩톱 컴퓨터 또는 태블릿 컴퓨터와 같은 다수의 형태 중 임의의 형태로 구현될 수 있음을 이해해야 한다. 또한, 컴퓨터는 일반적으로 컴퓨터로 간주되지 않지만 PDA(Personal Digital Assistant), 스마트 폰 또는 임의의 다른 적절한 휴대용 또는 고정식 전자 디바이스를 포함하는 적절한 프로세싱 기능을 갖는 디바이스에 내장될 수 있다.The above-described embodiments may be implemented in any number of ways. For example, embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code may be executed on any suitable processor or collection of processors, whether provided on a single computer or distributed among multiple computers. It should also be understood that the computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Also, a computer is not generally considered a computer, but may be embedded in a device having suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone, or any other suitable portable or stationary electronic device.

이러한 컴퓨터는 근거리 네트워크 또는 엔터프라이즈 네트워크 또는 인터넷과 같은 광역 네트워크를 포함하는 임의의 적절한 형태의 하나 이상의 네트워크에 의해 상호 접속될 수 있다. 이러한 네트워크는 임의의 적절한 기술에 기초할 수 있으며, 임의의 적절한 프로토콜에 따라 동작할 수 있고, 무선 네트워크, 유선 네트워크 또는 광섬유 네트워크를 포함할 수 있다.These computers may be interconnected by one or more networks of any suitable form, including local area networks or enterprise networks or wide area networks such as the Internet. Such networks may be based on any suitable technology, may operate according to any suitable protocol, and may include wireless networks, wired networks, or fiber optic networks.

또한, 본원에 설명되는 다양한 방법 또는 프로세스는 다양한 운영 체제 또는 플랫폼 중 임의의 하나를 채용하는 하나 이상의 프로세서 상에서 실행 가능한 소프트웨어로서 코딩될 수 있다. 또한, 이러한 소프트웨어는 다수의 적절한 프로그래밍 언어 및/또는 프로그래밍 또는 스크립팅 도구 중 임의의 것을 사용하여 기록될 수 있으며, 또한 프레임워크 또는 가상 머신 상에서 실행되는 실행 가능 머신 언어 코드 또는 중간 코드로서 컴파일될 수 있다.In addition, the various methods or processes described herein may be coded as software executable on one or more processors employing any one of a variety of operating systems or platforms. Further, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and may also be compiled as executable machine language code or intermediate code running on a framework or virtual machine. .

이 관점에서, 일부 실시예는, 하나 이상의 컴퓨터 또는 다른 프로세서 상에서 실행될 때 전술한 다양한 실시예를 구현하는 방법을 수행하는 하나 이상의 프로그램으로 인코딩된 컴퓨터 판독 가능 매체(또는 복수의 컴퓨터 판독 가능 매체)(예를 들어, 컴퓨터 메모리, 하나 이상의 플로피 디스크, 콤팩트 디스크, 광학 디스크, 자기 테이프, 플래시 메모리, 필드 프로그래머블 게이트 어레이 또는 다른 반도체 디바이스, 또는 다른 유형의 컴퓨터 저장 매체의 회로 구성)로 구현될 수 있다. 컴퓨터 판독 가능 매체 또는 매체들은 비일시적일 수 있다. 컴퓨터 판독 가능 매체 또는 매체들은 전술한 바와 같이 예측 모델링의 다양한 양태를 구현하기 위해 하나 이상의 다른 컴퓨터 또는 다른 프로세서 상에 저장된 프로그램 또는 프로그램들이 로딩될 수 있도록 이송 가능할 수 있다. "프로그램" 또는 "소프트웨어"라는 용어는 본원에서 설명된 다양한 양태를 구현하기 위해 컴퓨터 또는 다른 프로세서를 프로그램하도록 채용될 수 있는 임의의 유형의 컴퓨터 코드 또는 컴퓨터-실행 가능 명령어를 지칭하는 일반적인 의미로 본원에서 사용된다. 또한, 본 발명의 일 양태에 따르면, 실행될 때 예측 모델링 방법을 수행하는 하나 이상의 컴퓨터 프로그램이 단일 컴퓨터 또는 프로세서 상에 존재할 필요는 없으며, 예측 모델링의 다양한 양태를 구현하기 위해 다수의 상이한 컴퓨터 또는 프로세서 중에 모듈 형식으로 분산될 수 있다는 것이 이해되어야 한다.In this regard, some embodiments include computer readable media (or a plurality of computer readable media) encoded with one or more programs that, when executed on one or more computers or other processors, perform the methods of implementing the various embodiments described above ( for example, computer memory, one or more floppy disks, compact disks, optical disks, magnetic tapes, flash memory, field programmable gate arrays or other semiconductor devices, or circuit configurations of other types of computer storage media). The computer-readable medium or media may be non-transitory. The computer-readable medium or media may be transportable such that a program or programs stored on one or more other computers or other processors may be loaded to implement various aspects of predictive modeling as described above. The terms “program” or “software” are used herein in their generic sense to refer to any type of computer code or computer-executable instructions that may be employed to program a computer or other processor to implement the various aspects described herein. is used in Further, in accordance with an aspect of the present invention, one or more computer programs that perform the predictive modeling method when executed need not reside on a single computer or processor, but among a number of different computers or processors to implement the various aspects of predictive modeling. It should be understood that it can be distributed in a modular fashion.

컴퓨터-실행 가능 명령어는 하나 이상의 컴퓨터 또는 다른 디바이스에 의해 실행되는 프로그램 모듈과 같은 많은 형태일 수 있다. 일반적으로, 프로그램 모듈은 특정 작업을 수행하거나 특정 추상 데이터 유형을 구현하는 루틴, 프로그램, 객체, 컴포넌트, 데이터 구조 등을 포함한다. 통상적으로, 프로그램 모듈의 기능은 다양한 실시예에서 요구되는 바와 같이 조합되거나 분산될 수 있다.Computer-executable instructions can be in many forms, such as program modules, being executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functions of the program modules may be combined or distributed as required in various embodiments.

또한, 데이터 구조는 임의의 적절한 형태로 컴퓨터-판독 가능 매체에 저장될 수 있다. 예시의 단순화를 위해, 데이터 구조는 데이터 구조의 위치를 통해 관련된 필드를 갖는 것으로 나타낼 수 있다. 이러한 관계는 필드들 간의 관계를 전달하는 컴퓨터-판독 가능 매체의 위치들을 필드들에 대한 저장소에 할당함으로써 유사하게 달성될 수 있다. 그러나 포인터, 태그 또는 데이터 요소 간의 관계를 확립하는 다른 메커니즘의 사용을 포함하여 데이터 구조의 필드에 있는 정보 간의 관계를 확립하는 데 임의의 적절한 메커니즘이 사용될 수 있다.Further, the data structures may be stored in computer-readable media in any suitable form. For simplicity of illustration, a data structure may be represented as having fields associated through the location of the data structure. This relationship may be similarly achieved by allocating locations on a computer-readable medium conveying the relationship between the fields to the storage for the fields. However, any suitable mechanism may be used to establish relationships between information in fields of a data structure, including the use of pointers, tags, or other mechanisms for establishing relationships between data elements.

또한, 예측 모델링 기술은, 그 예가 제공되는 방법으로서 구현될 수 있다. 이 방법의 일부로 수행된 행위는 임의의 적절한 방식으로 순서화될 수 있다. 따라서, 예시된 실시예에서 순차적인 동작으로 도시되었지만, 몇몇 동작을 동시에 수행하는 것을 포함할 수 있는, 나타낸 것과 다른 순서로 동작이 수행되는 실시예가 구성될 수 있다.Also, predictive modeling techniques may be implemented as a method in which examples are provided. The actions performed as part of this method may be ordered in any suitable manner. Thus, although shown as sequential operations in the illustrated embodiments, embodiments may be constructed in which the operations are performed in an order other than that shown, which may include performing several operations simultaneously.

일부 실시예에서, 방법(들)은 컴퓨터의 랜덤 액세스 메모리의 일부에 저장된 컴퓨터 명령으로서 구현되어 전술한 프로세스들에 영향을 주는 제어 로직을 제공할 수 있다. 이러한 실시예에서, 프로그램은 FORTRAN, PASCAL, C, C++, C#, Java, 자바 스크립트, Tcl 또는 BASIC과 같은 다수의 고급 언어 중 임의의 언어로 기록될 수 있다. 또한, 프로그램은 스크립트, 매크로 또는 EXCEL 또는 VISUAL BASIC과 같은 상용 소프트웨어에 포함된 기능으로 기록될 수 있다. 또한, 소프트웨어는 컴퓨터 상에 상주하는 마이크로프로세서에 대한 어셈블리 언어로 구현될 수 있다. 예를 들어, IBM PC 또는 PC 클론에서 실행되도록 구성된 경우 Intel 80x86 어셈블리 언어로 소프트웨어가 구현될 수 있다. 이 소프트웨어는 플로피 디스크, 하드 디스크, 광 디스크, 자기 테이프, PROM, EPROM 또는 CD-ROM과 같은 "컴퓨터-판독 가능 프로그램 수단"을 포함하지만 이에 제한되지 않는 제조 물품에 수록될 수 있다.In some embodiments, the method(s) may be implemented as computer instructions stored in a portion of a computer's random access memory to provide control logic affecting the processes described above. In such an embodiment, the program may be written in any of a number of high-level languages such as FORTRAN, PASCAL, C, C++, C#, Java, JavaScript, Tcl, or BASIC. In addition, programs can be recorded as scripts, macros or functions included in commercial software such as EXCEL or VISUAL BASIC. Additionally, the software may be implemented in assembly language for a microprocessor resident on a computer. For example, software may be implemented in Intel 80x86 assembly language when configured to run on an IBM PC or PC clone. The software may be embodied in an article of manufacture including, but not limited to, “computer-readable program means” such as a floppy disk, hard disk, optical disk, magnetic tape, PROM, EPROM or CD-ROM.

본 발명의 다양한 양태들은 단독으로, 조합하여, 또는 위에서 구체적으로 설명되지 않은 다양한 배열로 사용될 수 있으며, 따라서 본 발명은 전술한 설명 또는 나타내어진 도면에서 개진된 컴포넌트의 상세 사항 및 배열로 그 응용을 한정시키지 않는다. 예를 들어, 일 실시예에서 설명된 양태는 다른 실시예에서 설명된 양태와 임의의 방식으로 조합될 수 있다.The various aspects of the invention may be used alone, in combination, or in various arrangements not specifically described above, and thus the invention finds its application in the details and arrangement of components set forth in the foregoing description or in the drawings shown. do not limit For example, an aspect described in one embodiment may be combined in any manner with an aspect described in another embodiment.

용어Terms

본원에서 사용된 어구 및 용어는 설명의 목적을 위한 것이며 한정하는 것으로 간주되어서는 안된다.The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.

본 명세서 및 청구항에서 사용되는 부정 관사 "어느(a)" 및 "어떤(an)"은 명시적으로 반대로 표시되지 않는 한 "적어도 하나"를 의미하는 것으로 이해되어야 한다. 명세서 및 청구항에서 사용된 "및/또는"이라는 문구는 이와 같이 조합된 요소의 "어느 하나 또는 둘"을 의미하는 것으로 이해되어야 하며, 즉, 요소는 어떤 경우에는 조합적으로 존재하고 다른 경우에는 분리적으로 존재한다. "및/또는"으로 나열된 복수의 요소는 동일한 방식으로, 즉 조합된 요소 중 "하나 이상"으로 해석되어야 한다. 구체적으로 식별된 요소와 관련이 있거나 관계가 없든 "및/또는" 절에 의해 구체적으로 식별되는 요소 이외에 다른 요소가 선택적으로 존재할 수 있다. 따라서, 비한정적인 예로서, "포함하는(comprising)"과 같은 제한이 없는 언어와 함께 사용될 때 "A 및/또는 B"에 대한 참조는 일 실시예에서 A만으로(선택적으로 B 이외의 요소를 포함); 다른 실시 양태에서, B만(선택적으로 A 이외의 요소를 포함함)으로; 또 다른 구현 예에서는, A 및 B 모두(선택적으로 다른 원소를 포함 함); 등을 나타낸다.As used herein and in the claims, the indefinite articles “a” and “an” should be understood to mean “at least one” unless explicitly indicated to the contrary. The phrase "and/or" as used in the specification and claims should be understood to mean "either one or two" of the elements so combined, i.e., the elements are present in combination in some cases and separate in other cases. exist as an enemy. A plurality of elements listed as “and/or” should be construed in the same way, ie, as “one or more” of the combined elements. Other elements may optionally be present other than those specifically identified by the "and/or" clause, whether related or unrelated to the specifically identified element. Thus, as a non-limiting example, reference to "A and/or B" when used in conjunction with a non-limiting language such as "comprising" may in one embodiment refer to only A (optionally elements other than B). include); in other embodiments, with only B (optionally including elements other than A); In another embodiment, both A and B (optionally including other elements); etc.

명세서 및 청구항에서 사용되는 "또는"은 상기 정의된 바와 같은 "및/또는"과 동일한 의미를 갖는 것으로 이해되어야 한다. 예를 들어, 목록에서 항목을 분리할 때, "또는" 또는 "및/또는"은 포괄적인 것으로 해석되어야 하는데, 즉 하나 이상의 번호 또는 하나 이상의 요소의 목록을 포함하는 것으로, 선택적으로 목록에 없는 추가 항목이 있다. "오직 하나" 또는 "정확히 하나"와 같이 청구항에서 사용될 때 "~로 이루어진(consisting of)"과 같이 반대로 명백하게 나타내어지지 않는다면, 숫자 또는 요소의 목록의 정확히 하나의 요소의 포함을 나타낼 것이다. 일반적으로, 사용된 "또는"이라는 용어는 "어느 하나", "하나", "단지 하나", 뙤는 "정확히 하나"와 같이 배타적인 용어로 선행될 경우에만 배타적 대안(즉, "하나 또는 다른 하나이지만 둘 다는 아님")을 나타내는 것으로 해석되어야 한다. 청구항에서 사용되는 경우 "본질적으로 이루어지는"은 특허법 분야에서 통상적으로 사용되는 의미를 갖는다.As used in the specification and claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items from a list, "or" or "and/or" should be construed as inclusive, i.e., including one or more numbers or lists of one or more elements, optionally including an unlisted addition. There is an item. When used in a claim, such as "only one" or "exactly one", it shall indicate the inclusion of exactly one element of a number or list of elements, unless expressly indicated to the contrary, such as "consisting of". In general, the term "or," as used, refers to an exclusive alternative (i.e., "one or the other one but not both"). "Consisting essentially of" when used in the claims has the meaning commonly used in the field of patent law.

명세서 및 청구범위에서 사용된 바와 같이, 하나 이상의 컴포넌트의 목록과 관련하여 "적어도 하나"라는 문구는 임의의 하나 이상의 요소의 목록의 요소로부터 선택된 적어도 하나의 요소를 의미하는 것으로 이해되어야 하지만, 요소 목록에 구체적으로 나열된 각 요소와 요소 목록 중 적어도 하나를 포함할 필요는 없으며 요소 목록의 요소 조합을 제외하지 않아야 한다. 이 정의는 또한 "적어도 하나"라는 문구가 언급된 요소 목록 내에서 구체적으로 식별되는 요소 이외에 요소가 선택적으로 존재할 수 있게 하며, 구체적으로 식별된 요소와 관련이 있거나 관계가 없다. 따라서, 비한정적인 예로서, "A 및 B 중 적어도 하나"(또는 동등하게, "A 또는 B 중 적어도 하나" 또는 동등하게 "A 및/또는 B 중 적어도 하나")는 하나 이상의 실시 양태에서 B를 함유하지 않는(및 선택적으로 B를 제외한 다른 원소를 포함하는) 하나 이상의 A를 임의로 포함하고; 다른 실시 양태에서 A는 존재하지 않는(및 선택적으로 A 이외의 원소를 포함하는) 하나 이상의 B, 선택적으로 하나 이상의 B를 포함하고; 또 다른 구현 예에서는 하나 이상의 A 및 임의로 하나 이상의 B를 포함하는 하나 이상의(및 선택적으로 다른 원소를 포함하는) 하나 이상의 A와 조합할 수 있는 등이다.As used in the specification and claims, the phrase “at least one” in reference to a list of one or more components should be understood to mean at least one element selected from the elements of the list of any one or more elements, but the list of elements It is not necessary to include at least one of each element and list of elements specifically listed in This definition also allows for elements to optionally be present other than those specifically identified within the list of elements in which the phrase "at least one" is referenced, with or without relation to the specifically identified element. Thus, by way of non-limiting example, “at least one of A and B” (or equivalently, “at least one of A or B” or equivalently “at least one of A and/or B”) means, in one or more embodiments, B optionally comprising at least one A that does not contain (and optionally includes elements other than B); In other embodiments A comprises at least one B absent (and optionally comprising elements other than A), optionally at least one B; In another embodiment, one or more A and optionally one or more B may be combined with one or more (and optionally other elements) one or more A's, and the like.

"포함하는(including)", "포괄하는(comprising)", "갖는(having)", "포함하는(containing)", "포함하는(involving)" 및 이들의 변형의 사용은 그 이후 열거된 항목 및 추가 항목을 포함하는 것을 의미한다.Uses of "including," "comprising," "having," "containing," "involving," and variations thereof, refer to the enumerated item thereafter. and additional items.

청구항 요소를 수식하기 위해 청구항에서 "제1", "제2", "제3 등"과 같은 서수 용어를 사용하는 것은 방법의 동작이 수행되는 다른 요소 또는 시간적 순서에 대해 하나의 청구항 요소의 임의의 우선 순위, 선행, 또는 순서를 암시하지 않는다. 서수 용어는 단지 특정 이름을 가진 하나의 청구항 요소를, 청구항 요소들을 구분하기 위해 동일한 이름을 가진 다른 요소(그러나 서수 용어 사용)와 구별하기 위한 단지 레이블로 사용된다.The use of ordinal terms such as “first,” “second,” “third, etc.” in a claim to modify a claim element means that the use of an ordinal term in one claim element relative to the temporal order or other element in which the operation of the method is performed. does not imply precedence, precedence, or order of An ordinal term is only used as a label to distinguish one claim element with a particular name from another element with the same name (but use the ordinal term) to distinguish claim elements.

균등물equivalent

따라서, 본 발명의 적어도 하나의 실시예의 여러 양태를 설명하였으므로, 다양한 변경, 수정 및 개선이 본 기술 분야의 통상의 기술자자에게 용이하게 생각날 것임을 이해해야 한다. 이러한 변경, 수정 및 개선은 본 발명의 일부로서 의도되며 본 발명의 사상 및 범위 내에 있는 것으로 의도된다. 따라서, 전술한 설명 및 도면은 단지 예일 뿐이다.Accordingly, having described various aspects of at least one embodiment of the present invention, it should be understood that various changes, modifications, and improvements will readily occur to those skilled in the art. Such changes, modifications and improvements are intended to be a part of this invention and are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description and drawings are by way of example only.

Claims

A predictive modeling method comprising:
performing a predictive modeling procedure, the predictive modeling procedure comprising:
(a) obtaining time series data comprising one or more data sets, each data set comprising a plurality of observations, each observation comprising (1) an indication of a time associated with the observation and (2) one or more variables obtaining the time series data including respective values of ;
(b) determining a time interval of the time series data;
(c) identifying the one or more variables as targets and zero or more other variables as features;
(d) determining an expected range and a skip range associated with a prediction problem represented by the time series data, wherein the expected range represents a duration of time for which values of the targets are to be predicted, and wherein the skip range is the expected range and the skip range, representing the time lag between the time associated with the earliest prediction in the expected range and the time associated with the latest observation on which the predictions in the expected range are based. determining;
(e) generating training data from the time series data, the training data comprising a first subset of observations in at least one of the data sets, the first subset of observations comprising the observations training-input and training-output collections of generating the training data, wherein a range separates an end of the training-input time range from a beginning of the training-output time range, and a duration of the training-output time range is at least as long as the expected range;
(f) generating testing data from the time series data, the testing data comprising a second subset of observations in at least one of the data sets, the second subset of observations comprising the observations and test-input and test-validation collections of , the skip range separates the end of the test-input time range from the beginning of the test-validation time range, and the duration of the test-validation time range is at least as long as the expected range. to do;
(g) fitting a predictive model to the training data; and
(h) testing the customized model against the testing data.

According to claim 1,
wherein the time interval of the time series data is determined based at least in part on times associated with at least a subset of observations included in at least one of the data sets.

3. The method of claim 2,
The step of determining the time interval of the time series data includes:
determining, for each of the data sets, a respective time interval of the data set;
determining that the time intervals of the data sets are constant; and
and setting a time interval of the time series data to a time interval of the data sets.

4. The method of claim 3,
Determining the time interval of the data set comprises:
determining, for one or more pairs of successive observations included in the data set, a respective time period between the successive observations;
determining that the period of time between the pair of consecutive observations is constant; and
and setting a time interval of the data set to a time interval between the pair of consecutive observations.

3. The method of claim 2,
The step of determining the time interval of the time series data includes:
determining, for each of the data sets, a respective time interval of the data set; and
determining that the time intervals of at least two of the data sets are different;
The time interval of the time series data is at least partially in at least one of (1) respective portions of the observations included in each of the data sets, and (2) each of the respective time intervals of the data sets. A predictive modeling method that is determined based on

6. The method of claim 5,
Determining each time interval of the data set comprises:
determining each time period between each successive pair of observations included in the data set;
If the time period between the pair of consecutive observations represents a plurality of non-uniform durations, then the time interval of the data set is: (1) respective portions of the pair of consecutive observations representing each of the non-uniform durations, and ( 2) is determined based at least in part on at least one of the durations of the time period;
where the time period between the pair of consecutive observations is a constant duration, the time interval of the data set is the respective duration of the time period.

6. The method of claim 5,
wherein the time interval of the time series data is the shortest time interval that is an integer multiple of each of the respective time intervals of the data sets.

8. The method of claim 7,
For each data set, if the time interval of the data set is shorter than the time interval of the time series data, convert the time interval of the data set to the time interval of the time series data by down-sampling the observations in the data set A predictive modeling method further comprising the step of:

9. The method of claim 8,
Down-sampling the observations in the data set comprises: for each instance of a time interval of the time series data in a time period corresponding to the data set:
identifying all observations in the data set associated with times corresponding to the respective instances of the time interval of the time series data;
aggregating the identified observations to produce an aggregate observation; and
replacing the identified observations in the data set with the aggregate observation.

10. The method of claim 9,
and the number of identified observations corresponding to instances of the time interval of the time series data is equal to a ratio between the time interval of the time series data and the time interval of the data set.

10. The method of claim 9,
Aggregating the identified observations comprises: calculating the value of each variable of the aggregated observation, (1) the value of the corresponding variable included in the earliest of the identified observations, (2) the most recent of the identified observations. (3) the maximum value of the corresponding variable values included in the identified observations, (4) the minimum value of the corresponding variable values included in the identified observations, (5) the corresponding variable values included in the identified observations; and setting to the mean of the corresponding variable values included in the observations, or (6) the value of a function of the corresponding variable values included in the identified observations.

6. The method of claim 5,
wherein the time interval of the time series data is selected from the group consisting of time intervals of the data sets.

13. The method of claim 12,
the data sets include a first data set representing a first time interval and a second data set representing a second time interval greater than the first time interval, wherein the second time interval is selected as the time interval of the time series data become,
The method is:
converting a time interval of the first data set to a time interval of the time series data by down-sampling the observations in the first data set.

6. The method of claim 5,
wherein the time interval of the time series data is different from each of the time intervals of the data set.

According to claim 1,
At least one group of observations of the time series data includes respective values of a first variable, the method comprising:
Before fitting the predictive model to the training data and testing the customized model against the testing data:
determining that values of the first variable include time values;
for each observation in the group, generating a respective value of a second variable, wherein the value of the second variable comprises an offset between the time value of the first variable and a reference time value generating each value of the second variable; and
and adding the values of the second variable to respective observations in the group.

16. The method of claim 15,
and removing values of the first variable from observations within the group.

16. The method of claim 15,
wherein the reference time includes the date of the event.

18. The method of claim 17,
wherein the event comprises birth, marriage, graduation from school, commencement of employment with an employer, or commencement of work in a particular position.

According to claim 1,
The variables include a first variable and a second variable, the method comprising:
determining that changes in values of the first variable and the second variable are correlated with a time lag between changes in the value of the first variable and the correlated changes in the value of the second variable; ; and
and displaying, via a graphical user interface, graphical content representing a duration of a time lag between changes in the value of the first variable and correlated changes in the value of the second variable. .

According to claim 1,
The expected range includes (1) a time interval of the time series data, (2) the number of observations included in the time series data, (3) a period of time corresponding to the time series data, and (4) microseconds, milliseconds, seconds , minutes, hours, days, weeks, months, quarters, seasons, years, decades, hundred years and millennia, the predictive modeling is determined based at least in part on at least one of a natural time period selected from the group consisting of: Way.

21. The method of claim 20,
The predicted range is an integer multiple of a time interval of the time series data.

22. The method of claim 21,
and an interval between times associated with successive predictions of the expected range is equal to a time interval of the time series data.

According to claim 1,
The skip range includes a latency in the collection of the time series data, a waiting time in communication of the time series data, a waiting time in the analysis of the time series data, a waiting time in communication of the analysis of the time series data, and a waiting time in implementation of actions based on the analysis of the time series data.

According to claim 1,
The duration of the training-input time range is defined as the total number of observations included in the time series data, the amount of variation in values of at least one of the variables over time, and the periodicity in at least one of the values of the variables. determining based, at least in part, on at least one of an amount of variation, a consistency of variation in at least one values of the variables over a plurality of time periods, and a duration of the expected range. Way.

25. The method of claim 24,
fitting the predictive model to training data comprises fitting the predictive model to a subset of training data corresponding to a portion of the training-input time range;
wherein the portion of the training-input time range begins at a time subsequent to a start time of the training-input time range and ends at an end time of the training-input time range.

26. The method of claim 25,
wherein the duration of the portion of the training-input time range is an integer multiple of the duration of the expected range.

According to claim 1,
and down-sampling the training data prior to fitting the predictive model to the training data.

28. The method of claim 27,
Down-sampling the training data comprises:
removing, from the training data, all observations obtained from at least one of the data sets.

28. The method of claim 27,
Down-sampling the training data comprises:
setting the down-sampled time interval of the training data to an integer multiple of the time interval of the time series data; and
For each instance of the down-sampled time interval of the training data:
identifying all observations in the training data associated with times corresponding to the respective instances of the down-sampled time interval of the training data;
aggregating the identified observations to aggregate an aggregate observation; and
replacing the identified observations in the training data with the aggregate observation.

According to claim 1,
and down-sampling the testing data prior to testing the customized model against the testing data.

According to claim 1,
The predictive modeling method further comprising the step of performing cross-validation of the predictive model.

32. The method of claim 31,
wherein the training data is first training data, the testing data is first testing data, the customized model is a first customized model, and performing mutual validation of the predictive model includes:
(i) generating second training data and second testing data from the time series data, the second training data comprising a third subset of observations in at least one of the data sets; generating the second training data and second testing data, wherein the second testing data comprises a fourth subset of observations in at least one of the data sets;
(j) fitting the predictive model to the second training data to obtain a second customized model; and
(k) testing the second customized model against the second testing data.

33. The method of claim 32,
the first subset of observations corresponds to a sliding training window covering a first range of training times, each observation included in the first subset is associated with a time within the first range of training times, wherein the third subset of s corresponds to a sliding training window covering a second range of training times, each observation included in the third subset is associated with a time within the second range of training times; and the earliest time of the first range of s is earlier than the earliest time of the second range of training times.

34. The method of claim 33,
wherein the second subset of observations corresponds to a sliding test window covering a first range of testing times, each observation included in the second subset is associated with a time within the first range of testing times; wherein the fourth subset of s corresponds to a sliding test window covering a second range of testing times, each observation included in the fourth subset is associated with a time within the second range of testing times; and the earliest time of the first range of s is earlier than the earliest time of the second range of testing times.

35. The method of claim 34,
and the first testing time range partially overlaps the second training time range.

36. The method of claim 35,
and the second testing time range does not overlap any portion of the first training time range and does not overlap any portion of the second training time range.

33. The method of claim 32,
and partitioning the time series data into a plurality of partitions including at least a first partition and a second partition.

38. The method of claim 37,
and partitioning the time series data into a plurality of partitions comprises assigning each of the data sets to a corresponding partition.

38. The method of claim 37,
Partitioning the time-series data into the plurality of partitions includes temporally partitioning the time-series data.

40. The method of claim 39,
wherein each of the partitions corresponds to a respective portion of a period associated with the time series data, and each observation included in the time series data is assigned to a partition corresponding to a portion of a period of time that matches the time associated with the observation. modeling method.

38. The method of claim 37,
the first training data comprises a subset of observations included in the first partition of the time series data;
the first testing data includes respective subsets of observations included in all partitions of the time series data except for the first partition;
the second training data comprises a subset of observations included in the second partition of the time series data; and
and the second testing data includes respective subsets of observations included in all partitions of the time series data except for the second partition.

33. The method of claim 32,
The first division of the time series data includes the first training data and the second training data and the first testing data and the second testing data, and the second division of the time series data includes holdout data. comprising, the method comprising:
and testing the first customized model and the second customized model against the holdout data.

43. The method of claim 42,
The predictive modeling method, wherein the predictive model is not customized to the holdout data.

According to claim 1,
and performing nested mutual validation of the predictive model.

45. The method of claim 44,
The step of performing nested mutual validation of the predictive model includes:
dividing the time series data into a first plurality of divisions including at least a first division of the time series data and a second division of the time series data; and
the first division of the time series data into a plurality of divisions of the first division of the time series data comprising at least a first division of the first division of the time series data and a second division of the first division of the time series data partitioning the partition;
wherein the training data comprises the first partition of the first partition of the time series data, and the testing data includes at least the first partition of the time series data other than the first partition of the first partition of the time series data. A predictive modeling method comprising a plurality of partitions.

46. The method of claim 45,
wherein the training data is first training data, the testing data is first testing data, the customized model is a first customized model, and performing nested mutual validation of the predictive model comprises:
(i) generating second training data and second testing data from the first partition of the time series data, the second training data comprising the second partition of the first partition of the time series data; wherein the second testing data comprises at least a plurality of partitions of the first partition of the data set other than the second partition of the first partition of the time series data. step;
(j) fitting the predictive model to the second training data to obtain a second customized model; and
(k) testing the second customized model against the second testing data.

47. The method of claim 46,
The step of performing the nested mutual validation includes:
testing the first customized model and the second customized model for the second partition of the time series data; and
comparing the first customized model and the second customized model based on a result of testing the first customized model and the second customized model on the second partition of the time series data; A predictive modeling method comprising:

According to claim 1,
determining, for the customized model, model-specific predictive values of one or more features of the time series data.

49. The method of claim 48,
pruning a feature from the time series data based at least in part on model-specific predicted values of the features, generating a feature derived from two or more features in the time series data, and comparing the derived feature to the from the group consisting of at least one of adding to time series data, blending the predictive model with other predictive models, and allocating resources during the process of evaluating the adequacy of predictive modeling procedures for the predictive problem. The predictive modeling method, further comprising: performing at least one selected action.

According to claim 1,
determining, based, at least in part, on at least one of characteristics of the predictive problem and attributes of respective predictive modeling procedures, appropriateness of the plurality of predictive modeling procedures for the predictive problem;
selecting one or more predictive modeling procedures from the plurality of predictive modeling procedures based on the determined suitability of the selected modeling procedures for the predictive problem; and
and performing the one or more predictive modeling procedures.

51. The method of claim 50,
Performing the one or more predictive modeling procedures comprises:
sending an instruction to a plurality of processing nodes, the instruction comprising a resource allocation schedule allocating resources of the processing nodes for execution of the selected modeling procedures, wherein the resource allocation schedule is configured for the prediction problem. based at least in part on the relevance of selected modeling procedures;
receiving results of execution of the selected modeling procedures by the plurality of processing nodes according to the resource allocation schedule, the results comprising: predictive models generated by the selected modeling procedures; and the prediction problem; the receiving comprising at least one of the scores of the generated models for associated time series data; and
selecting, from the generated models, a predictive model for the predictive problem based at least in part on a score of the selected predictive model.

According to claim 1,
and mixing the customized model with other customized models to create a blended predictive model.

According to claim 1,
and deploying the customized model.

54. The method of claim 53,
wherein the time series data is first time series data, and placing the customized model comprises generating one or more predictions by applying the customized model to second time series data representing one or more instances of the prediction problem. and wherein the first time series data does not include the second time series data.

54. The method of claim 53,
wherein the time series data is first time series data, and wherein deploying the customized model comprises refreshing the customized model based at least in part on second time series data.

56. The method of claim 55,
wherein the customized model is a first customized model, and refreshing the customized model based at least in part on the second time series data comprises:
performing the predictive modeling procedure on the second time series data to generate a second customized model; and
mixing the first customized model and the second customized model to generate a refreshed predictive model.

56. The method of claim 55,
Refreshing the customized model based at least in part on the second time series data comprises:
and performing the predictive modeling procedure on third time series data including at least a portion of the first time series data and at least a portion of the second time series data to generate a refreshed predictive model.

54. The method of claim 53,
The customized model is deployed on one or more servers, and other customized models are also deployed on the one or more servers, and a prediction request for the customized model and the other customized model generates a prediction: (1) among the servers based at least in part on at least one of an estimate of the amount of time used by each of the customized models to Assigned, predictive modeling method.

59. The method of claim 58,
Each prediction request is assigned to a respective thread, each prediction request has an associated latency-sensitivity value, and the number of threads executing on a particular server is the latency of the threads executing on the particular server. - a predictive modeling method, determined based at least in part on the sensitivity values.

According to claim 1,
determining a value of a metric representing an interaction strength of two or more features included in the time series data; and
When the value of the metric exceeds a threshold, generating time-series values of a new feature based on the values of the two or more features, and adding the new feature to the time-series data. .

According to claim 1,
Further comprising the step of determining the temporal resolution of the time series data, predictive modeling method.

According to claim 1,
wherein the targets are identified based on user input.

In the predictive modeling apparatus,
A memory configured to store a machine-executable module encoding a predictive modeling procedure, the predictive modeling procedure comprising a plurality of operations including at least one pre-processing operation and at least one model-fitting operation Memory; and
at least one processor configured to execute the machine-executable module;
execution of the machine-executable module causes the apparatus to perform the predictive modeling procedure, the predictive modeling procedure comprising performing the pre-processing operation and performing the model-fitting operation;
Performing the pre-processing task is:
(a) obtaining time series data comprising one or more data sets, each data set comprising a plurality of observations, each observation comprising (1) an indication of a time associated with the observation and (2) one or more variables obtaining the time series data including the respective values of
(b) determining a time interval of the time series data;
(c) identifying the one or more variables as targets and zero or more other variables as features; and
(d) determining an expected range and a skip range associated with a prediction problem represented by the time series data, wherein the expected range indicates a duration of a period for which values of the targets are to be predicted, and wherein the skip range is the expected range indicating a time lag between a time associated with the earliest prediction in , and a time associated with the most recent observation upon which the prediction in the expected range is based;
Performing the model-fitting task is:
(e) generating training data from the time series data, the training data comprising a first subset of observations in at least one of the data sets, the first subset of observations comprising the observations training-input and training-output collections of generating the training data, wherein a range separates the end of the training-input time range from the beginning of the training-output time range, and wherein a duration of the training-output time range is at least as long as the expected range;
(f) generating testing data from the time series data, the testing data comprising a second subset of observations in at least one of the data sets, the second subset of observations comprising the observations and test-input and test-validation collections of , the skip range separates the end of the test-input time range from the beginning of the test-validation time range, and the duration of the test-validation time range is at least as long as the expected range. step to do,
(g) fitting a predictive model to the training data; and
(h) testing the customized model against the testing data.

64. The method of claim 63,
wherein the machine-executable module comprises a directed graph representing dependencies between the tasks.

delete