RU2680760C1

RU2680760C1 - Scoring models development and control computerized method

Info

Publication number: RU2680760C1
Application number: RU2017146235A
Authority: RU
Inventors: Олег Игоревич Травкин; Дмитрий Алексеевич Берестнев; Дмитрий Владимирович Юдочев; Екатерина Сергеевна Жуковская
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Priority date: 2018-04-04
Filing date: 2018-04-04
Publication date: 2019-02-26
Also published as: EA038056B1; EA201700609A1; WO2019194696A1

Abstract

FIELD: computer equipment.

SUBSTANCE: invention relates to the field of computer equipment. Disclosed is the scoring models development and control computerized method, including the following steps: obtaining data for the specified period of time, containing the scoring model affecting factors; performing the obtained data partitioning into samples for the scoring model development, validation and testing; performing the factors transformation by establishing relationships between groups of the transformed factor values and the defaults levels; excluding from the samples at least one transformed factor, correlating with at least one other factor; forming the credit scoring model through the binary multiple logistic regression training; automatically selecting the cut-off zones for at least one scoring model for its installation into the credit procedure.

EFFECT: technical result is increase in the developed credit scoring models quality.

10 cl, 2 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[1] Данное техническое решение, в общем, относится к области вычислительной техники, а в частности к способам автоматической разработки моделей кредитного скоринга и их автоматической имплементации в кредитный процесс.[1] This technical solution, in General, relates to the field of computer technology, and in particular to methods for the automatic development of credit scoring models and their automatic implementation in the loan process.

УРОВЕНЬ ТЕХНИКИBACKGROUND

[2] В настоящее время финансовые учреждения применяют стандартные статистические подходы к анализу исторических данных для описания возможных клиентов с точки зрения риска. Это позволяет классифицировать заемщиков на «хороших» и «плохих» и таким образом принимать окончательное решение о кредитовании. В большинстве кредитных учреждений созданы подразделения, разрабатывающие модели кредитного скоринга на основании собственной статистики с учетом специфики клиентского профиля. Однако данные кредитные учреждения часто обращаются в бюро кредитных историй, из-за чего процесс оценки кредитоспособности заемщика сильно затягивается и становится неточным, так как зависит от использованных алгоритмов бюро кредитных историй.[2] Currently, financial institutions are using standard statistical approaches to the analysis of historical data to describe potential customers in terms of risk. This allows you to classify borrowers as “good” and “bad” and thus make the final decision on lending. Most credit institutions have created units that develop credit scoring models based on their own statistics, taking into account the specifics of the client profile. However, these credit institutions often turn to the credit bureaus, which is why the process of assessing the creditworthiness of the borrower is dragged out and becomes inaccurate, as it depends on the algorithms used by the credit bureaus.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[3] Данное техническое решение направлено на устранение недостатков, присущих существующим решениям, известным из уровня техники.[3] This technical solution is aimed at eliminating the disadvantages inherent in existing solutions known from the prior art.

[4] Технической проблемой (или технической задачей) в данном техническом решении является осуществление автоматической разработки моделей кредитного скоринга с их последующей имплементацией в систему принятия решения и мониторингом.[4] The technical problem (or technical problem) in this technical solution is the automatic development of credit scoring models with their subsequent implementation in the decision-making system and monitoring.

[5] Техническим результатом, проявляющимся при решении вышеуказанной задачи, является повышение качества создаваемых моделей кредитного скоринга.[5] The technical result manifested in solving the above problem is to improve the quality of the created credit scoring models.

[6] Дополнительным техническим результатом, проявляющимся при решении технической задачи, является увеличение скорости разработки моделей кредитного скоринга.[6] An additional technical result that manifests itself in solving the technical problem is to increase the speed of developing credit scoring models.

[7] Также снижается потребность в количестве ресурсов, необходимых для разработки и поддержки моделей, увеличение скорости и простоты внедрения моделей в промышленный контур, а также обеспечение мониторинга работы моделей и оперативной реакции на изменения.[7] The need for the amount of resources necessary for the development and support of models, the increase in the speed and ease of implementation of models in the industrial circuit, as well as the monitoring of the operation of models and the operational response to changes are also decreasing.

[8] Указанный технический результат достигается благодаря осуществлению способа разработки и управления моделями скоринга, в котором получают данные за заданный период времени, содержание факторы, влияющие на модель скоринга; после чего осуществляют разбиение полученных данных на выборки для разработки, валидации и тестирования модели скоринга; затем осуществляют трансформацию факторов посредством установления соотношений между группами значений преобразованного фактора и уровнями дефолтов; далее исключают из выборок по меньшей мере один преобразованный фактор, коррелирующий с по меньшей мере одним другим фактором; формируют модель кредитного скоринга посредством обучения бинарной множественной логистической регрессии; подбирают автоматически зоны отсечения для по меньшей мере одной модели скоринга для ее установки в кредитную процедуру.[8] The specified technical result is achieved through the implementation of a method for the development and management of scoring models, in which data is obtained for a given period of time, the content of factors affecting the scoring model; after that, the obtained data is divided into samples for the development, validation and testing of the scoring model; then the transformation of factors is carried out by establishing relationships between groups of values of the converted factor and default levels; at least one transformed factor correlating with at least one other factor is further excluded from the samples; form a credit scoring model through training of binary multiple logistic regression; automatically select clipping zones for at least one scoring model for its installation in the credit procedure.

[9] В некоторых вариантах осуществления получают данные за заданный период времени с мобильного устройства связи пользователя.[9] In some embodiments, data is obtained for a predetermined period of time from a user's mobile communication device.

[10] В некоторых вариантах осуществления при осуществлении разбиения полученных данных на выборки получают непересекающиеся во времени части исходной совокупности или случайные подвыборки.[10] In some embodiments, when dividing the data into samples, non-intersecting parts of the original population or random subsamples are obtained.

[11] В некоторых вариантах осуществления факторами, влияющими на модель скоринга, являются годовой доход и/или размер непогашенного долга, и/или владение недвижимостью, и/или владение автомобилем, и/или стаж работы на последнем месте, и/или возраст.[11] In some embodiments, factors affecting the scoring model are annual income and / or amount of outstanding debt, and / or ownership of real estate, and / or possession of a car, and / or work experience in last place, and / or age.

[12] В некоторых вариантах осуществления факторы, влияющие на модель скоринга, являются дискретными или непрерывными.[12] In some embodiments, the factors affecting the scoring model are discrete or continuous.

[13] В некоторых вариантах осуществления при осуществлении трансформации факторов определяют степень отклонения уровня дефолтов по группе данных от среднего уровня дефолтов по всей выборке.[13] In some embodiments, the implementation of the transformation of factors determines the degree of deviation of the level of defaults in the data group from the average level of defaults throughout the sample.

[14] В некоторых вариантах осуществления при осуществлении трансформации факторов по факторам, попавшим в список исключенных, запускают алгоритм разбиения значений факторов с новым набором настроек.[14] In some embodiments, when transforming factors into factors that are on the excluded list, the algorithm for splitting factor values with a new set of settings is launched.

[15] В некоторых вариантах осуществления при исключении из выборок преобразованных факторов формируют таблицу со значениями коэффициентов парных корреляций преобразованных факторов.[15] In some embodiments, when excluding transformed factors from the samples, a table is created with the values of the pair correlation coefficients of the transformed factors.

[16] В некоторых вариантах осуществления при исключении из выборок преобразованных факторов в цикле отбирают фактор, который имеет наибольшее количество коррелированных с ним факторов.[16] In some embodiments, when excluding transformed factors from the samples, the factor that has the highest number of factors correlated with it is selected in the cycle.

[17] В некоторых вариантах осуществления при формировании модели кредитного скоринга строится логистическая модель с использованием пошаговой регрессии для отбора итогового набора факторов.[17] In some embodiments, when creating a credit scoring model, a logistic model is constructed using stepwise regression to select the final set of factors.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[18] Признаки и преимущества настоящего изобретения станут очевидными из приводимого ниже подробного описания изобретения и прилагаемых чертежей, на которых:[18] The features and advantages of the present invention will become apparent from the following detailed description of the invention and the accompanying drawings, in which:

[19] На Фиг. 1 показан пример осуществления способа разработки и управления моделями скоринга в виде блок-схемы.[19] In FIG. 1 shows an example implementation of a method for developing and managing scoring models in the form of a flowchart.

[20] На Фиг. 2 показана верхнеуровневая примерная схема осуществления способа разработки и управления моделями скоринга. Основное ядро составляют два блока - это переобучение и подбор/корректировка зон отсечения, причем без адаптации зон отсечения невозможно организовать автоматическое внедрение модели в систему принятия решения. Результаты двух этих блоков интегрируются в промышленную среду (в данном варианте осуществления в SAS RTDM). Кроме того, каждый из этих двух блоков подвергается регламентным проверкам в виде ежедневного мониторинга целевого показателя, зависящего от зон отсечения (уровень одобрения) и ежемесячной валидации моделей.[20] In FIG. 2 shows a top-level example implementation diagram of a method for developing and managing scoring models. The main core is made up of two blocks - this is retraining and selection / adjustment of cut-off zones, and without adaptation of cut-off zones it is impossible to organize automatic implementation of the model in the decision-making system. The results of these two blocks are integrated into the industrial environment (in this embodiment, in SAS RTDM). In addition, each of these two blocks is subject to routine checks in the form of daily monitoring of the target indicator, depending on the cut-off zones (level of approval) and monthly model validation.

ПОДРОБНОЕ ОПИСАНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

[21] Данное техническое решение может быть реализовано на компьютере, в виде автоматизированной системы (АС) или машиночитаемого носителя, содержащего инструкции для выполнения вышеупомянутого способа.[21] This technical solution can be implemented on a computer, in the form of an automated system (AS) or a machine-readable medium containing instructions for performing the above method.

[22] Техническое решение может быть реализовано в виде распределенной компьютерной системы.[22] The technical solution can be implemented as a distributed computer system.

[23] В данном решении под системой подразумевается компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ПЛК (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность вычислительных операций (действий, инструкций).[23] In this solution, a system means a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems, and any other devices that can perform a given, well-defined sequence of computing operations (actions, instructions).

[24] Под устройством обработки команд подразумевается электронный блок либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы).[24] A command processing device is understood to mean an electronic unit or an integrated circuit (microprocessor) that executes machine instructions (programs).

[25] Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одного или более устройства хранения данных. В роли устройства хранения данных могут выступать, но, не ограничиваясь, жесткие диски (HDD), флеш-память, ПЗУ (постоянное запоминающее устройство), твердотельные накопители (SSD), оптические приводы.[25] An instruction processing device reads and executes machine instructions (programs) from one or more data storage devices. Data storage devices may include, but are not limited to, hard disks (HDDs), flash memory, ROM (read only memory), solid state drives (SSDs), and optical drives.

[26] Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.[26] A program is a sequence of instructions intended for execution by a computer control device or an instruction processing device.

[27] Ниже будут описаны термины и понятия, необходимые для осуществления настоящего технического решения.[27] Terms and concepts necessary for the implementation of this technical solution will be described below.

[28] Кредитный скоринг - это метод моделирования кредитного риска заемщика, основанный на численных статистических методах. Назначение кредитного скоринга - принятие решений по выдаче кредитов физическим или юридическим лицам.[28] Credit scoring is a method of modeling a borrower's credit risk based on numerical statistical methods. The purpose of credit scoring is to make decisions on granting loans to individuals or legal entities.

[29] P-value - величина, используемая при тестировании статистических гипотез. Фактически это вероятность ошибки при отклонении нулевой гипотезы (ошибки первого рода).[29] P-value is a value used in testing statistical hypotheses. In fact, this is the probability of error in rejecting the null hypothesis (errors of the first kind).

[30] Репрезентативность - соответствие характеристик выборки характеристикам популяции или генеральной совокупности в целом. Репрезентативность определяет, насколько возможно обобщать результаты исследования с привлечением определенной выборки на всю генеральную совокупность.[30] Representativeness is the correspondence of the characteristics of the sample to the characteristics of the population or the population as a whole. Representativeness determines how much it is possible to generalize the results of the study with the involvement of a specific sample for the entire population.

[31] DR - уровень дефолтов. Рассчитывается как число дефолтных наблюдений в группе, деленное на число всех наблюдений в группе.[31] DR - default rate. It is calculated as the number of default cases in the group divided by the number of all cases in the group.

[32] Бутстреп - практический компьютерный метод исследования распределения статистик вероятностных распределений, основанный на многократной генерации выборок на базе имеющейся выборки.[32] Bootstrap is a practical computer method for studying the distribution of statistics of probability distributions, based on the multiple generation of samples based on the available sample.

[33] Вероятность дефолта - вероятность наступления дефолта по сделке в течение одного года с даты присвоения/корректировки рейтинга.[33] Default probability - the probability of a transaction occurring within one year from the date of rating assignment / adjustment.

[34] Выборка - набор сделок и их параметров, отвечающих заданным характеристикам и представляющим из себя часть анализируемой генеральной совокупности.[34] A sample is a set of transactions and their parameters that meet specified characteristics and are part of the analyzed population.

[35] Выборка для обучения - набор сделок и их параметров, использующихся для оценки модели.[35] A sample for training is a set of transactions and their parameters used to evaluate the model.

[36] Выборка для оценки стабильности - набор сделок и их параметров, использующихся для оценки стабильности ранжирующей способности факторов и их разбиений.[36] A sample for assessing stability is a set of transactions and their parameters that are used to assess the stability of the ranking ability of factors and their partitions.

[37] Выборка для тестирования - данные по всем имеющимся договорам за все доступные отчетные даты. Определяется применительно к сегменту, на котором разрабатывается модель.[37] Sampling for testing - data on all available contracts for all available reporting dates. It is defined in relation to the segment on which the model is developed.

[38] Генеральная совокупность - совокупность пар «сделка-дата», относящихся к выделенному сегменту.[38] General population - a set of transaction-date pairs related to a selected segment.

[39] Дискретные факторы - факторы с ограниченным количеством вариантов значений.[39] Discrete factors - factors with a limited number of value options.

[40] Непрерывные факторы - факторы с неограниченным количеством возможных вариантов значений.[40] Continuous factors - factors with an unlimited number of possible values.

[41] Обучающая выборка - набор сделок и их параметров, использующихся для разработки модели.[41] A training sample is a set of transactions and their parameters used to develop a model.

[42] Преобразование факторов - замена значений факторов на расчетные величины (скоры, WOE), связанные с оценкой вероятности дефолта, относящейся к значению фактора.[42] Conversion of factors - replacement of the values of factors by the calculated values (scores, WOE) associated with the assessment of the probability of default related to the value of the factor.

[43] Скоринговый балл - значение показателя качества сделок с точки зрения вероятности их дефолта.[43] Scoring score - the value of the indicator of the quality of transactions in terms of the probability of their default.

[44] Тестовая выборка - выборка, используемая для проверки эффективности полученной модели (не участвует в разработке)[44] Test sample - a sample used to verify the effectiveness of the resulting model (not involved in development)

[45] Трансформация факторов - то же, что и преобразование факторов.[45] The transformation of factors is the same as the transformation of factors.

[46] PD - величина вероятности дефолта.[46] PD is the probability of default.

[47] WOE (англ. weight of evidence) - величина, которая характеризует степень отклонения уровня дефолтов по группе от среднего уровня дефолтов по всей выборке.[47] WOE (English weight of evidence) - a value that characterizes the degree of deviation of the level of defaults in the group from the average level of defaults throughout the sample.

[48] Компьютеризированный способ разработки и управления моделями скоринга, схематично показанный на Фиг. 1, включает следующие шаги.[48] A computerized method for developing and managing scoring models, schematically shown in FIG. 1 includes the following steps.

[49] Шаг 101: получают данные за заданный период времени, содержащие факторы, влияющие на модель скоринга.[49] Step 101: Obtain data for a given period of time containing factors that influence the scoring model.

[50] Данные пользователя могут включать текущее состояние счетов (включая закрытые) - даты открытия, текущие остатки, срок, валюта, тип и название продукта, количество пролонгаций, текущий статус и так далее, не ограничиваясь.[50] User data may include the current state of accounts (including closed ones) - opening dates, current balances, term, currency, type and name of product, number of extensions, current status, and so on, without limitation.

[51] Также полученные данные могут включать ежемесячные балансы (на конец каждого месяца) по каждому счету за последний промежуток времени (например, за последние полгода), все операции за тот же период с суммой, типом и подтипом, с признаком «дебет/кредит».[51] The data obtained may also include monthly balances (at the end of each month) for each account for the last period of time (for example, for the last six months), all transactions for the same period with the amount, type and subtype, with the sign “debit / credit ".

[52] Вышеуказанные данные, которые представляют собой выборку, могут получать с мобильного устройства связи пользователя, например, такого как планшет, мобильный телефон, смартфон, или из автоматизированной системы финансово-кредитной организации, в которой хранятся данные.[52] The above data, which is a sample, can be obtained from a mobile communication device of a user, for example, such as a tablet, mobile phone, smartphone, or from an automated system of a financial and credit organization in which data is stored.

[53] На основе полученных данных о пользователях автоматически определяют кредитный скоринг, т.е. прогнозируют невозврат выданного кредита пользователем. Для этого используют обучающую выборку: набор объектов (пользователей), каждый из которых характеризуется набором признаков (таких как возраст, зарплата, тип кредита, состояние счетов, ежемесячные балансы, невозвраты в прошлом и т.д.), а также целевым признаком. Целевым признаком может быть, например, просрочка кредита. Если этот целевой признак - просто факт невозврата кредита (принимает значение 1 или 0, т.е. финансово-кредитная организация знает о своих клиентах, кто вернул кредит, а кто - нет), то это задача (бинарной) классификации. Если известно, насколько по времени клиент затянул с возвратом кредита и хочется то же самое прогнозировать для новых клиентов, то это будет задачей регрессии.[53] Based on the obtained user data, credit scoring is automatically determined, ie predict the non-repayment of the loan by the user. To do this, use a training sample: a set of objects (users), each of which is characterized by a set of characteristics (such as age, salary, type of loan, state of accounts, monthly balances, defaults, etc.), as well as the target attribute. A target may be, for example, a loan overdue. If this target attribute is simply a fact of non-repayment of the loan (takes the value 1 or 0, i.e. the financial and credit institution knows about its customers who repaid the loan and who did not), then this is a (binary) classification task. If you know how much time the client delayed in repaying the loan and you want to predict the same thing for new customers, then this will be a regression task.

[54] Для каждой группы счетов (депозиты и прочие счета) могут учитываться следующие данные или факторы:[54] For each group of accounts (deposits and other accounts), the following data or factors may be considered:

[55] Кол-во счетов;[55] Number of accounts;

[56] Кол-во счетов со статусом "Действующий";[56] Number of accounts with the status of "Active";

[57] Кол-во счетов со статусом "Закрыт";[57] Number of accounts with the status "Closed";

[58] Кол-во счетов со статусом "Счет арестован";[58] Number of accounts with the status "Account Arrested";

[59] "Худший" статус по всем счетам клиента;[59] Worst status across all customer accounts;

[60] Количество счетов в иностранной валюте;[60] The number of accounts in foreign currency;

[61] Количество счетов в драгоценных металлах;[61] The number of accounts in precious metals;

[62] Минимальный срок по счетам;[62] Minimum term on accounts;

[63] Средний срок по счетам;[63] The average term on the accounts;

[64] Максимальный срок по счетам;[64] The maximum term for the accounts;

[65] Минимальный срок по действующим счетам;[65] Minimum term on current accounts;

[66] Средний срок по действующим счетам;[66] Average term on current accounts;

[67] Максимальный срок по действующим счетам;[67] The maximum period for current accounts;

[68] Средневзвешенный по текущему остатку в рублях срок договора;[68] Weighted average term for the current balance in rubles;

[69] Общая сумма текущих остатков;[69] The total amount of current balances;

[70] Максимальная сумма остатка по всем счетам;[70] Maximum balance of all accounts;

[71] Средневзвешенный по текущему остатку доля валютных счетов;[71] The weighted average share of foreign currency accounts at the current balance;

[72] Средневзвешенный по текущему остатку доля счетов в драгоценных металлах;[72] Weighted average share of accounts in precious metals for the current balance;

[73] Время в днях, прошедшее с даты открытия самого раннего счета.[73] Time in days elapsed since the opening date of the earliest account.

[74] Специалисту в данном уровне техники, очевидно, что представленный выше набор данных является примерным и в некоторых вариантах осуществления может отличаться от приведенного выше.[74] It will be apparent to those skilled in the art that the above data set is exemplary and, in some embodiments, may differ from the above.

[75] Далее осуществляют формирование по меньшей мере одной выборки для разработки модели скоринга. Для этого используются наиболее актуальные выданные за один календарный год кредиты, находящиеся в портфеле не менее 12 месяцев. Поскольку модели скоринга разрабатываются для прогнозирования поведения всех заемщиков, ее разработка исключительно на выданных заявках может привести к неточным результатам. В таком случае модель будет обучена на смещенной выборке, поэтому осуществляют анализ заявок, по которым получены отказы предыдущей модели скоринга. В целях учета этих отказов к выборке для разработки модели скоринга добавляется некоторый процент худших заявок, по которым получен отказ предшествующей модели. Все такие заявки считаются по умолчанию дефолтными.[75] Next, at least one sample is generated to develop a scoring model. For this, the most relevant loans issued for one calendar year are used, which have been in the portfolio for at least 12 months. Since scoring models are developed to predict the behavior of all borrowers, its development solely on issued applications may lead to inaccurate results. In this case, the model will be trained on a biased sample, therefore, they analyze the applications for which failures of the previous scoring model were received. In order to account for these failures, a certain percentage of the worst applications for which the failure of the previous model was received is added to the sample for developing the scoring model. All such applications are considered defaulted by default.

[76] Шаг 102: осуществляют разбиение полученных данных на выборки для обучения, валидации и тестирования модели скоринга.[76] Step 102: carry out the partitioning of the data into samples for training, validation and testing of the scoring model.

[77] На данном этапе исходная совокупность данных разбивается на обучающую, валидационную и тестовую выборку в заданном соотношении. В дальнейшем обучающая выборка используется на всех этапах процесса, валидационная применяется для отбора наиболее стабильных факторов и итоговой проверки качества модели скоринга, а тестовая - для комплексного независимого тестирования. Выборки в некоторых вариантах осуществления могут формироваться как последовательные, непересекающиеся во времени части исходной совокупности или как случайные подвыборки.[77] At this stage, the initial data set is divided into a training, validation and test sample in a given ratio. In the future, the training sample is used at all stages of the process, the validation sample is used to select the most stable factors and the final quality control of the scoring model, and the test sample is used for complex independent testing. Samples in some embodiments may be formed as consecutive, non-intersecting parts of the original population, or as random subsamples.

[78] Шаг 103: осуществляют трансформацию факторов посредством установления соотношений между группами значений преобразованного фактора и уровнями дефолтов.[78] Step 103: carry out the transformation of factors by establishing relationships between groups of values of the transformed factor and default levels.

[79] В качестве факторов, используемых в качестве входных параметров для моделей скоринга и потенциально связанных с кредитоспособностью пользователя, могут быть, не ограничиваясь как годовой доход, размер непогашенного долга, владение недвижимостью или автомобилем, стаж работы на последнем месте, возраст и т.п.[79] The factors used as input parameters for scoring models and potentially related to the user's creditworthiness may include, but are not limited to, annual income, outstanding debt, ownership of real estate or a car, work experience in last place, age, etc. P.

[80] Среди факторов, описывающих данные кредитной заявки, большую часть обычно составляют дискретные (образование, пол, семейное положение, цель кредита, вид собственности на жилье, род деятельности и т.п.). При этом если некоторые факторы поддаются некоторому упорядочению (например, образование - можно считать, что чем выше уровень, тем больше значение переменной), то для других не существует никакого осмысленного линейного порядка (например, семейное положение или цель кредита). Следовательно, такие переменные нельзя даже приблизительно считать непрерывными, поскольку их значения суть номера ответов на соответствующие вопросы, которые могут располагаться в произвольном порядке. Если используемая модель скоринга требует использования непрерывных переменных, то можно обойти дискретность переменных, заменив их на большее количество переменных, принимающих значения от 0 до 1.[80] Among the factors describing the data of the loan application, the majority are usually discrete (education, gender, marital status, purpose of the loan, type of ownership of housing, occupation, etc.). Moreover, if some factors lend themselves to some ordering (for example, education - we can assume that the higher the level, the greater the value of the variable), then for others there is no meaningful linear order (for example, marital status or purpose of the loan). Therefore, such variables cannot even be considered approximately continuous, since their values are the numbers of answers to the corresponding questions, which can be arranged in an arbitrary order. If the scoring model used requires the use of continuous variables, then the discreteness of the variables can be circumvented by replacing them with a larger number of variables taking values from 0 to 1.

[81] Трансформация каждого рассматриваемого фактора заключается в замене его значений расчетной величиной - WOE.[81] The transformation of each factor under consideration consists in replacing its values with a calculated value - WOE.

[82] WOE - weight of evidence, характеризует степень отклонения уровня дефолтов по группе данных от среднего уровня дефолтов по всей выборке. Таким образом, каждый фактор заменяется соответствующим ему WOE-фактором следующим образом:[82] WOE - weight of evidence, characterizes the degree of deviation of the level of defaults for a group of data from the average level of defaults for the entire sample. Thus, each factor is replaced by its corresponding WOE factor as follows:

[83] где f - рассматриваемый фактор, i - номер группы значений фактора f, WOE_i(f) - значение WOE, соответствующее группе значений i.[83] where f is the factor under consideration, i is the number of the group of values of the factor f, WOE _i (f) is the value of WOE corresponding to the group of values of i.

[84] В некоторых вариантах осуществления показатель WOE может принимать любые значения. Положительные значения WOE говорят о том, что рассматриваемый сегмент имеет более низкое значение уровня дефолтов, чем выборка в целом (чем больше WOE, тем ниже уровень дефолтов). Значение WOE меньше нуля говорит о том, что рассматриваемый сегмент имеет более высокое значение уровня дефолтов, чем выборка в целом. Значения WOE по группе i может определяться следующим образом:[84] In some embodiments, the WOE metric can be any value. Positive values of WOE indicate that the segment in question has a lower default level than the sample as a whole (the higher the WOE, the lower the level of defaults). A value of WOE less than zero indicates that the segment in question has a higher default level than the sample as a whole. The WOE values for group i can be determined as follows:

[85] где N_G(i) и N_G - количество недефолтных наблюдений в группе i и по всей выборке, соответственно, N_B(i) и N_B - количество дефолтных наблюдений в группе i и по всей выборке, соответственно.[85] where N _G (i) and N _G are the number of non-default cases in group i and for the entire sample, respectively, N _B (i) and N _B are the number of default cases in group i and for the entire sample, respectively.

[86] Если N_G(i)=0 или N_B(i)=0, то значение WOE для группы определяется по формуле:[86] If N _G (i) = 0 or N _B (i) = 0, then the WOE value for the group is determined by the formula:

[87] Для непрерывных факторов группировка осуществляется таким образом, чтобы в каждый диапазон попадали наблюдения с сопоставимым уровнем дефолтов (DR). В результате процесса группировки непрерывный фактор делится на несколько групп, для каждого из которых возможно оценить уровень дефолтов на базе наблюдений, попавших в этот диапазон.[87] For continuous factors, the grouping is carried out in such a way that observations with a comparable level of defaults (DR) fall into each range. As a result of the grouping process, the continuous factor is divided into several groups, for each of which it is possible to assess the level of defaults based on observations that fall into this range.

[88] Группировка переменных с дискретным набором значений осуществляется аналогично группировке непрерывных факторов - на основании сопоставимого уровня дефолтов (DR). В каждую группу может попадать одно или несколько значений фактора. Уровень дефолтов вычисляется по всем наблюдениям, входящим в группу.[88] The grouping of variables with a discrete set of values is carried out similarly to the grouping of continuous factors - based on a comparable level of defaults (DR). One or several factor values may fall into each group. The level of defaults is calculated for all the observations included in the group.

[89] Использование WOE-факторов имеет следующие преимущества:[89] The use of WOE factors has the following advantages:

[90] Линеаризация факторов в соответствии с предпосылками логистической регрессии.[90] Linearization of factors in accordance with the prerequisites of logistic regression.

[91] Автоматическая обработка пропущенных значений: они либо объединяются с наиболее похожей по уровню дефолтов группой, либо выступают в качестве отдельной группы. В случае, когда пропущенное значение не интерпретируемо или отсутствует в выборке, то оно относится в худшую по уровню риска группу.[91] Automatic processing of missing values: they are either combined with the group most similar in defaults or act as a separate group. In the case when the missing value is not interpreted or is absent in the sample, then it belongs to the group with the worst risk level.

[92] Автоматическая обработка аномальных значений, так как они не способны негативно повлиять на модель и их фактическое значение не используется в модели. Они войдут в модель как элемент одной из крайних групп, характеризующейся своим WOE-значением, основанном только на соотношении дефолтных и недефолтных наблюдений в группе.[92] Automatic processing of anomalous values, since they are not able to adversely affect the model and their actual value is not used in the model. They will enter the model as an element of one of the extreme groups, characterized by their WOE value, based only on the ratio of default and non-default observations in the group.

[93] Возможность оценить и контролировать логичность направления связи значений фактора и уровня дефолтов (бизнес-логику), что позволяет гарантировать, что итоговые скоринговые баллы будут иметь смысл (например, люди старшего возраста, обычно, набирают больше баллов, чем молодые). Логичные связи подтверждают бизнес-опыт, поэтому позволяют получить более стабильную модель.[93] The ability to evaluate and control the logical direction of the relationship between the factor values and the level of defaults (business logic), which ensures that the final scoring points will make sense (for example, older people usually score more points than young ones). Logical connections confirm business experience, therefore, they allow to obtain a more stable model.

[94] Позволяет снизить риск переобучения. В модель не включается каждое случайное изменение данных, что имело бы место в случае не сгруппированных атрибутов. Такая модель обладает большей гибкостью и способна выдержать некоторые изменения в популяции, что обеспечивает стабильность в течение более долгого периода времени.[94] Reduces the risk of retraining. Each random data change is not included in the model, which would be the case for ungrouped attributes. Such a model has greater flexibility and is able to withstand some changes in the population, which ensures stability over a longer period of time.

[95] Первоначальная группировка значений факторов может происходить с помощью однофакторных деревьев решений. Это позволяет увеличить дискриминирующую способность полученных факторов по сравнению с ручными группировками, так как полученные группы будут максимально однородны внутри и различны между собой на основании используемого статистического критерия.[95] The initial grouping of factor values can occur using one-factor decision trees. This makes it possible to increase the discriminatory ability of the obtained factors in comparison with manual groupings, since the obtained groups will be as homogeneous as possible inside and different from each other based on the statistical criterion used.

[96] Под дискриминирующей силой фактора понимают его способность дифференцировать дефолтные и недефолтные наблюдения. Для оценки дискриминирующей способности переменной может использоваться индекс Джини.[96] The discriminatory power of a factor is understood to mean its ability to differentiate between default and non-default observations. To assess the discriminatory ability of a variable, a Gini index can be used.

[97] На основании практики, имеющейся в уровне техники, по интерпретируемости используемых в скоринге факторов, необходимо обращать внимание не только на ранжирующую способность WoE-трансформированных факторов, но и на их бизнес-логику. По этой причине на данном этапе происходит не только автоматическое разбиение значений факторов и расчет для них WoE, но и проверка получившихся разбиений на бизнес-логику. Если полученное разбиение не проходит данную проверку, то алгоритм пытается получить новое разбиение, используя альтернативные настройки. Способ получения итоговых WoE-факторов включает шаги, приведенные ниже.[97] Based on the practice of the prior art, on the interpretability of the factors used in scoring, it is necessary to pay attention not only to the ranking ability of WoE-transformed factors, but also to their business logic. For this reason, at this stage, not only the automatic partition of the values of the factors and calculation of WoE for them occurs, but also the verification of the resulting partitions into business logic. If the received partition does not pass this test, then the algorithm tries to obtain a new partition using alternative settings. A method for obtaining the resulting WoE factors includes the steps below.

[98] Сначала запускают разбиения значений факторов с указанным набором настроек.[98] First, the splitting of factor values with the specified set of settings is started.

[99] Затем осуществляют слияние полученных групп по близости значений WoE в случае, если расстояние по WoE между группами не превосходит заданный порог. Для интервальных факторов также учитывается порядок следования групп, упорядоченных по значениям фактора. Факторы, у которых осталась всего одна группа после объединений, переходят в список исключенных.[99] Then, the obtained groups are merged by the proximity of the WoE values if the WoE distance between the groups does not exceed a predetermined threshold. For interval factors, the order of the groups ordered by the factor values is also taken into account. Factors that have only one group left after associations are transferred to the list of excluded ones.

[100] На следующем шаге осуществляют слияние групп маленького размера в соответствии с заданным пороговым значением с ближайшей по WoE группой. Для интервальных факторов также учитывается порядок следования групп, упорядоченных по значениям фактора. Факторы, у которых осталась всего одна группа после объединений, переходят в список исключенных. После каждого слияния необходимо вернуться ко второму пункту.[100] In the next step, small groups are merged according to a predetermined threshold value with the group closest to the WoE. For interval factors, the order of the groups ordered by the factor values is also taken into account. Factors that have only one group left after associations are transferred to the list of excluded ones. After each merger, you need to return to the second paragraph.

[101] Также важно проводить слияние полученных групп по близости значений WoE в случае, если сформировано больше групп, чем изначально заданное максимальное количество для данного предиктора. Для интервальных факторов также учитывается порядок следования групп, упорядоченных по значениям фактора. Факторы, у которых осталась всего одна группа после объединений, переходят в список исключенных. После каждого слияния необходимо вернуться ко второму пункту.[101] It is also important to merge the obtained groups according to the proximity of the WoE values if more groups are formed than the initially specified maximum number for a given predictor. For interval factors, the order of the groups ordered by the factor values is also taken into account. Factors that have only one group left after associations are transferred to the list of excluded ones. After each merger, you need to return to the second paragraph.

[102] В некоторых вариантах осуществления проверяют монотонность, условия немонотонности и направления риска для интервальных переменных в соответствии со справочником. Факторы, которые не соответствуют условиям из справочника, переходят в список исключенных.[102] In some embodiments, monotonicity, nonmonotonicity conditions, and risk directions for interval variables are checked in accordance with a manual. Factors that do not meet the conditions from the directory go into the list of excluded.

[103] В некоторых вариантах осуществления проверяют минимально допустимое количество групп. Если по переменной доступно меньше групп, чем изначально заданное минимально допустимое число, то она переходит в список исключенных.[103] In some embodiments, the minimum number of groups is checked. If the variable has fewer groups than the initially specified minimum number, then it goes into the excluded list.

[104] В некоторых вариантах осуществления проверяют условия соотношения риска в различных группах для категориальных и бинарных переменных в соответствии со справочником (проверка бизнес-логики). Условия задаются с помощью специального языка, который позволяет описывать паттерны соотношения риска в группах любой сложности. Факторы, которые не соответствуют условиям из справочника, переходят в список исключенных.[104] In some embodiments, the implementation of the check conditions of the ratio of risk in different groups for categorical and binary variables in accordance with the directory (verification of business logic). Conditions are set using a special language that allows you to describe patterns of risk ratios in groups of any complexity. Factors that do not meet the conditions from the directory go into the list of excluded.

[105] В некоторых вариантах осуществления проверяют падение коэффициента Джини. Если данный коэффициент по предиктору на валидационной выборке меньше изначально заданного порогового значения либо падает по сравнению с коэффициентом Джини на обучающей выборке более чем на заданное число процентов, то такой фактор переходит в список исключенных.[105] In some embodiments, a drop in the Gini coefficient is checked. If the given coefficient according to the predictor in the validation sample is less than the initially set threshold value or falls compared to the Gini coefficient in the training sample by more than a specified number of percent, then this factor goes into the excluded list.

[106] В некоторых вариантах осуществления проверяют стабильность порядка следования групп, упорядоченных по WoE. Происходит сравнение обучающей выборки и 20 выборок, случайным образом отобранных из объединения обучающей и валидационной. Факторы, у которых выявлена нестабильность в порядке следования групп, упорядоченных по WoE, переходят в список исключенных.[106] In some embodiments, the stability of the order of groups ordered by WoE is checked. A comparison of the training sample and 20 samples randomly selected from the combination of the training and validation is performed. Factors for which instability in the order of the groups ordered by WoE is detected is transferred to the list of excluded ones.

[107] По факторам, попавшим в список исключенных, необходимо запустить алгоритм разбиения значений факторов с новым набором настроек. Если доступных настроек нет или все они уже проверены, то формирование разбиений считается законченным. Количество настроек определяется возможностями используемого статистического пакета, например, на основании SAS Enterprise Miner. Таким образом, по результатам применения алгоритма формируются WoE-факторы. Исходные факторы, которые не прошли проверку ни при одном наборе настроек разбиения, исключаются из процесса.[107] For factors that are on the list of excluded, you must run the algorithm for splitting the values of factors with a new set of settings. If there are no available settings or all of them have already been checked, then the formation of partitions is considered complete. The number of settings is determined by the capabilities of the statistical package used, for example, based on SAS Enterprise Miner. Thus, according to the results of the application of the algorithm, WoE factors are formed. Initial factors that fail the test with any set of split settings are excluded from the process.

[108] Шаг 104: исключают из выборок по меньшей мере один преобразованный фактор, коррелирующий с по меньшей мере одним другим фактором.[108] Step 104: exclude from the samples at least one transformed factor that correlates with at least one other factor.

[109] Анализ парных корреляций используется для выявления коллинеарных зависимостей между переменными. Наличие корреляций между факторами повышает стандартные отклонения коэффициентов регрессии, что снижает их устойчивость и надежность в многофакторном анализе. Для корреляционного анализа рассчитывается матрица корреляций - таблица со значениями коэффициентов парных корреляций преобразованных WOE-факторов. Анализ данной таблицы позволяет определить переменные, имеющие высокие линейные связи с другими факторами. Значение, начиная с которого коэффициенты корреляции признаются высоким, устанавливается в справочнике. Рекомендуемое значение, начиная с которого коэффициенты корреляции признаются высоким, находится в диапазоне от 0.5 до 1 по модулю. Из каждой пары коррелирующих факторов следует оставить только один на основании либо более высокой индивидуальной предиктивной способности, либо большей важности фактора с точки зрения бизнес-логики. В системе используется следующий алгоритм: в цикле отбирается фактор, который имеет наибольшее количество коррелированных с ним факторов (значение корреляции выше выбранного порога). Если таких несколько, то из них выбирается фактор с наименьшим значением индекса Джини. Такой фактор исключается из рассмотрения. После этого отбирается следующий фактор с наибольшим количеством коррелированных с ним оставшихся факторов и наименьшим значением индекса Джини. Таким образом, на выходе из цикла остаются факторы без корреляций выше выбранного порога. Данный подход обеспечивает наибольшее число некоррелированных факторов в итоговом списке факторов для моделирования.[109] Pair correlation analysis is used to identify collinear relationships between variables. The presence of correlations between factors increases the standard deviations of the regression coefficients, which reduces their stability and reliability in multivariate analysis. For correlation analysis, the correlation matrix is calculated — a table with the values of the pair correlation coefficients of the transformed WOE factors. The analysis of this table allows us to determine variables that have high linear relationships with other factors. The value starting from which the correlation coefficients are considered high is set in the directory. The recommended value, starting from which the correlation coefficients are considered high, is in the range from 0.5 to 1 modulo. Of each pair of correlating factors, only one should be left on the basis of either a higher individual predictive ability or a greater importance of the factor from the point of view of business logic. The following algorithm is used in the system: in the cycle, the factor that has the largest number of factors correlated with it is selected (the correlation value is above the selected threshold). If there are several, then the factor with the lowest Gini index is selected from them. This factor is excluded from consideration. After that, the next factor is selected with the largest number of remaining factors correlated with it and the lowest value of the Gini index. Thus, factors without correlations above the selected threshold remain at the exit from the cycle. This approach provides the largest number of uncorrelated factors in the final list of factors for modeling.

[110] Шаг 105: формируют модель кредитного скоринга посредством обучения бинарной множественной логистической регрессии, имеющий следующий вид:[110] Step 105: form a credit scoring model by training binary multiple logistic regression, having the following form:

где Y - зависимая переменная (признак дефолта), Y=1 - событие дефолта, X₁, X₂, …, X_n - набор независимых, объясняющих WOE-факторов, β₀, β₁, β₂, …, β_n - коэффициенты логистической регрессии, PD - вероятность дефолта.where Y is the dependent variable (sign of default), Y = 1 is the event of default, X ₁ , X ₂ , ..., X _n is the set of independent explanatory WOE factors, β ₀ , β ₁ , β ₂ , ..., β _n - logistic regression coefficients, PD - probability of default.

[111] Значения вероятности дефолта (PD - Probability of Default) располагаются в интервале [0, 1]. Она показывает вероятность дефолта для каждого рассчитанного рейтинга.[111] The values of the probability of default (PD - Probability of Default) are in the interval [0, 1]. It shows the probability of default for each calculated rating.

[112] В некоторых вариантах осуществления значения вероятности дефолта могут располагаться в интервале от 0 до 100 в процентном или численном эквиваленте.[112] In some embodiments, the default probability values may range from 0 to 100 in percentage or numerical terms.

[113] Несмотря на отсутствие коррелирующих пар, исключенных на предыдущем шаге, между факторами модели скоринга может возникать мультиколлинеарность, поэтому на этапе построения модели скоринга необходимо проверять ее отсутствие. Кроме того, т.к. модель скоринга разрабатывается на основе WoE-факторов, а чем больше WoE, тем меньше риск, необходимо проверять корректность знака коэффициента в модели скоринга (все коэффициенты регрессии должны быть отрицательными). Помимо этого требуется обеспечить высокую стабильность модели, поэтому значимость каждого из входящих в нее факторов проверяется с помощью процедуры статистического бутстрэпа: каждый из факторов должен быть значим исходя из статистики Вальда минимум в 85% случаев. Способ формирования итоговой модели скоринга выглядит следующим образом.[113] Despite the absence of correlating pairs excluded in the previous step, multicollinearity can occur between the factors of the scoring model, therefore, at the stage of constructing the scoring model, it is necessary to check its absence. In addition, since the scoring model is developed based on WoE factors, and the more WoE, the less risk, it is necessary to check the correctness of the coefficient sign in the scoring model (all regression coefficients must be negative). In addition, it is necessary to ensure high stability of the model, so the significance of each of the factors included in it is checked using the statistical bootstrap procedure: each of the factors should be significant based on Wald statistics in at least 85% of cases. The method of forming the final scoring model is as follows.

[114] На основе всех факторов, дошедших до данного этапа, строится логистическая модель с использованием пошаговой регрессии (stepwise) для отбора итогового набора факторов.[114] Based on all the factors that have reached this stage, a logistic model is constructed using stepwise regression (stepwise) to select the final set of factors.

[115] Для таких факторов происходит расчет фактора инфляции дисперсии (Variance Inflation Factor, VIF), Для определения VIF необходимо оценить линейную регрессионную модель, где в качестве зависимой переменной будет рассматриваемый фактор, а в качестве независимых переменных будут выступать оставшиеся факторы, включенные в модель. Итоговое значение VIF для фактора может быть найдено по формуле:[115] For such factors, the Variance Inflation Factor (VIF) is calculated. To determine VIF, it is necessary to evaluate a linear regression model, where the factor under consideration will be the factor under consideration, and the remaining factors included in the model will act as independent variables . The final VIF value for the factor can be found by the formula:

где R² - коэффициент детерминации описанной выше модели. Переменная, значение VIF которой больше заданного значения и величина коэффициента Джини минимальна - исключается. Первый и второй шаги повторяются до тех пор, пока все включенные в модель факторы не будут иметь значение VIF ниже заданного.where R ² is the coefficient of determination of the model described above. A variable whose VIF value is greater than the specified value and the value of the Gini coefficient is minimal is excluded. The first and second steps are repeated until all factors included in the model have a VIF value lower than the specified value.

[116] Затем проводится проверка на наличие факторов с положительным знаком коэффициента регрессии. В случае их обнаружения происходит исключение фактора с минимальным значением коэффициента Джини, после чего необходимо вернуться к первому шагу. Если таких факторов нет, то следует перейти к следующему пункту.[116] Then a check is made for the presence of factors with a positive sign of the regression coefficient. If they are detected, the factor is excluded with the minimum value of the Gini coefficient, after which it is necessary to return to the first step. If there are no such factors, then go to the next paragraph.

[117] Далее осуществляется объединение обучающей и валидационной выборок. Из их объединения случайным образом отбирается несколько десятков выборок того же размера, что и обучающая. На каждой из полученных выборок происходит обучение модели скоринга с текущим набором факторов. Если есть факторы, которые значимы, по статистике Вальда, менее чем в 85% случаев, то исключается тот из них, величина коэффициента Джини которого является наименьшей. После исключения необходимо вернуться к первому шагу. Если таких факторов нет, то скоринговая модель считается успешно построенной.[117] Next, the combination of training and validation samples is carried out. Several dozen samples of the same size as the training one are randomly selected from their association. On each of the obtained samples, a scoring model is trained with the current set of factors. If there are factors that are significant, according to Wald's statistics, in less than 85% of cases, then the one with the least Gini coefficient is excluded. After the exception, you must return to the first step. If there are no such factors, then the scoring model is considered to be successfully built.

[118] Таким образом, алгоритм позволяет в автоматическом режиме разрабатывать скоринговые модели, отвечающие всем разумным требованиям качества. Помимо этого он гарантирует, что каждый фактор будет соответствовать бизнес-логике, описанной в специальном справочнике.[118] Thus, the algorithm allows the automatic development of scoring models that meet all reasonable quality requirements. In addition, he guarantees that each factor will correspond to the business logic described in a special guide.

[119] В некоторых вариантах осуществления проводят автоматическую валидацию модели в соответствии с любой методикой валидации статистических моделей, известной из уровня техники. На данном этапе рассчитываются количественные тесты для оценки качества модели. Процесс валидации использует тестовую выборку, сформированную на шаге 102, и генеральную совокупность данных. В случае прохождения валидации, переходим к следующему шагу, иначе пользователю системы направляется уведомление о том, что валидация не пройдена, а также подробный отчет о выявленных недостатках. Варьируя настройки алгоритма, пользователь может скорректировать подходы к моделированию и обеспечить успешность следующей валидации.[119] In some embodiments, the model is automatically validated in accordance with any statistical model validation technique known in the art. At this stage, quantitative tests are calculated to assess the quality of the model. The validation process uses the test sample generated in step 102 and a population of data. In the case of validation, proceed to the next step, otherwise a notification will be sent to the system user that the validation has not been completed, as well as a detailed report on the identified deficiencies. By varying the algorithm settings, the user can adjust the modeling approaches and ensure the success of the next validation.

[120] Выбор оптимального значения порога отсечения зависит от цены совершения ошибки первого и второго рода при классификации. Модель должна точнее классифицировать «плохих» заемщиков, т.к. в кредитном скоринге цена ошибки перового рода выше. При снижении порога отсечения в модели будет увеличиваться чувствительность, т.е. способность модели правильно выявлять тех заемщиков, у которых будет просрочка платежа. За оптимальный порог отсечения можно взять точку баланса между чувствительностью и специфичностью.[120] The choice of the optimal cut-off threshold value depends on the price of making errors of the first and second kind in the classification. The model should more accurately classify “bad” borrowers, as in credit scoring, the price of a pen-type error is higher. With a decrease in the cutoff threshold, the sensitivity will increase in the model, i.e. the ability of the model to correctly identify those borrowers who will have late payment. For the optimal cutoff threshold, you can take the balance point between sensitivity and specificity.

[121] Шаг 106: подбирают автоматически зоны отсечения для по меньшей мере одной модели скоринга для ее установки в кредитную процедуру.[121] Step 106: automatically select clipping zones for at least one scoring model for its installation in the credit procedure.

[122] Далее осуществляют автоматический подбор зон отсечения для моделей скоринга по скоринговым баллам для их установки в кредитную процедуру. Алгоритм подбора зон отсечения состоит из двух частей: внешней и внутренней. Внешняя часть отвечает за итеративный перебор уровней отсечения, внутренняя - за расчет ожидаемого уровня одобрения заявки на выдачу кредита, соответствующего текущему набору отсечений. Стоит отметить, что в качестве критерия для внутренней части алгоритма может выступать не только уровень одобрения, а любой интересующий показатель, зависящий от уровней отсечения, например, уровень риска или NPV портфеля. Алгоритм работает на исторической выборке данных по заявкам на кредиты. Ввиду того что уровень одобрения характеризуется сезонностью в рамках недели, в данном техническом решении речь идет о целевом уровне одобрения только в рамках семи дней, т.к. иначе придется определять его отдельно для каждого дня недели. Исходя из этого число дней, за которые рассматривается история по заявкам, должно быть кратно семи. Предположим, что в процессе принятия кредитного решения используется комбинация из трех моделей:[122] Next, automatic selection of cut-off zones for scoring models by scoring points is carried out for their installation in the credit procedure. The algorithm for selecting cut-off zones consists of two parts: external and internal. The external part is responsible for iterative selection of cutoff levels, the internal part is for calculating the expected level of approval of a loan application corresponding to the current set of cutoffs. It is worth noting that the criterion for the internal part of the algorithm can be not only the level of approval, but any indicator of interest that depends on cut-off levels, for example, the risk level or portfolio NPV. The algorithm works on a historical sample of data on loan applications. Due to the fact that the approval level is characterized by seasonality within the week, this technical solution deals with the target approval level only within seven days, as otherwise, you have to define it separately for each day of the week. Based on this, the number of days for which the history of applications is considered should be a multiple of seven. Suppose that in the process of making a credit decision a combination of three models is used:

1. Качества кредитной истории, или скоринга бюро кредитных историй (БКИ-скоринга);1. The quality of credit history, or scoring credit bureaus (BKI scoring);

2. Анкетных данных (заявочного скоринга);2. Personal data (application scoring);

3. Склонности к мошенничеству, или FDC-скоринга (Fraud Detection Card Scoring).3. The tendency to fraud, or FDC-scoring (Fraud Detection Card Scoring).

[123] Предположим, что мы имеем комбинацию баллов отсечения по моделям заявочного, FDC- и БКИ-скоринга. Пусть (t1, t2, t3) - значение корректировок для отсечений по соответствующим моделям, а (n1, n2, n3) - число последовательных повторений корректировки для каждой из соответствующих моделей. Тогда внешний алгоритм подбора баллов отсечения будет следующим. Последовательно для каждой из моделей скоринга необходимо осуществить следующие действия:[123] Suppose we have a combination of cut-off points for bid, FDC, and BKI scoring models. Let (t1, t2, t3) be the value of adjustments for cutoffs for the corresponding models, and (n1, n2, n3) be the number of consecutive repetitions of the adjustment for each of the corresponding models. Then the external algorithm for selecting cutoff points will be as follows. Consistently for each of the scoring models, it is necessary to carry out the following actions:

1. прибавить соответствующую t корректировку из (t1, t2, t3) к уровню отсечения по этой модели;1. add the appropriate t adjustment from (t1, t2, t3) to the cutoff level for this model;

2. запустить внутреннюю часть алгоритма, описанную далее, для подсчета ожидаемого уровня одобрения;2. run the internal part of the algorithm described below to calculate the expected level of approval;

3. если отклонение ожидаемого уровня одобрения изменило направление, то выбрать такую комбинацию уровней отсечения по моделям заявочного, FDC- и БКИ-скоринга, при которой отклонение ожидаемого уровня одобрения является наименьшим (фактически выбор осуществляется из последних двух проверяемых комбинаций);3. if the deviation of the expected level of approval has changed direction, then choose a combination of cut-off levels according to the application, FDC and BKI scoring models, in which the deviation of the expected level of approval is the smallest (in fact, the choice is made from the last two tested combinations);

4. если отклонение ожидаемого уровня одобрения от целевого не изменило направления и первый пункт повторился менее n из (n1, n2, n3) раз, то перейти к первому пункту, т.е. к корректировке следующей модели скоринга.4. if the deviation of the expected level of approval from the target did not change direction and the first paragraph was repeated less than n from (n1, n2, n3) times, then go to the first paragraph, i.e. to adjust the next scoring model.

[124] В некоторых вариантах осуществления вышеописанная процедура повторяется до тех пор, пока не будет получен целевой уровень одобрения или достигнута верхняя/нижняя граница баллов по каждой из моделей.[124] In some embodiments, the above procedure is repeated until a target level of approval is obtained or an upper / lower score is reached for each of the models.

[125] В рамках внутренней части алгоритма оценивается изменение уровня одобрения при изменении баллов отсечения по работающим скоринговым моделям. Как уже отмечалось ранее, эффект от изменения зон отсечения может оцениваться на различные показатели, будь то риск или доходность, но в любом случае необходимо оценить, кто будет одобрен в рамках новых зон отсечения, а кто отказан (или вероятность одного из этих событий). В связи с этим будет рассмотрен алгоритм оценки изменения уровня одобрения.[125] The internal part of the algorithm evaluates the change in the level of approval when changing the cutoff points for working scoring models. As noted earlier, the effect of changing cut-off zones can be evaluated on various indicators, whether it is risk or profitability, but in any case, it is necessary to assess who will be approved under the new cut-off zones and who is rejected (or the probability of one of these events). In this regard, an algorithm for assessing changes in the level of approval will be considered.

[126] Как правило, система принятия решения (СПР) в финансово-кредитной организации представляет собой последовательность проверок и применения правил и может включать следующие этапы прохождения заявок:[126] As a rule, a decision-making system (DSS) in a financial-credit organization is a sequence of checks and application of the rules and may include the following stages of the application:

1. отказ по минимальным требованиям, на основе данных системы Hunter, стоп-листа и др.;1. failure according to the minimum requirements, based on data from the Hunter system, stop list, etc .;

2. использование заявочного, БКИ- и FDC-скоринга;2. use of application, BKI and FDC scoring;

3. применение моделей благонадежности;3. application of reliability models;

4. андеррайтинг;4. underwriting;

5. отказы на последующих этапах.5. failures in subsequent stages.

[127] По этой причине для оценки уровня одобрения в случае переопределения фактических отказов скоринга по заявкам необходимо знать решение по ним на каждом из этапов, следующих за вторым этапом (использования скоринга). Для любой заявки, одобренной по всем работающим моделям скоринга (заявочного, БКИ-, FDC-скоринга и др.), доступна необходимая информация о процессе ее прохождения через последующие этапы СПР. Для заявок, по которым получен отказ хотя бы от одной из моделей, возникает неопределенность в отношении последующих этапов, т.к. такие заявки до этих этапов не доходят. Для того чтобы исключить данную неопределенность, в рамках алгоритма производится моделирование отказов после этапа скоринга для заявок, по которым ранее был получен отказ. Алгоритм можно представить как последовательность следующих действий.[127] For this reason, in order to assess the level of approval in case of redefinition of actual scoring refusals on applications, it is necessary to know the decision on them at each of the stages following the second stage (using scoring). For any application approved for all working scoring models (application, BKI, FDC scoring, etc.), the necessary information is available on the process of its passage through the subsequent stages of the DSS. For applications for which a rejection of at least one of the models is received, uncertainty arises with respect to the subsequent stages, since such applications do not reach these stages. In order to eliminate this uncertainty, within the framework of the algorithm, failure modeling is performed after the scoring stage for applications for which a failure has previously been received. The algorithm can be represented as a sequence of the following actions.

1. Для заявок, дошедших до этапа скоринга, производится симуляция отказов по трем видам моделей при новых баллах отсечения. Все заявки, по которым получены фактический отказ на этапе скоринга и одобрение по всем моделям во время симуляции, помечаются (для них необходимо отдельное моделирование вероятности отказа на последующих этапах СПР).1. For applications that have reached the scoring stage, failures are simulated for three types of models with new cutoff points. All applications for which an actual rejection was received at the scoring stage and approval for all models during the simulation are marked (they need a separate simulation of the probability of failure at the subsequent stages of the DSS).

2. Осуществляется моделирование вероятности отказа на этапе применения модели благонадежности. Для построения модели используются заявки, которые успешно прошли процедуру скоринга до изменения баллов отсечения.2. The simulation of the probability of failure at the stage of application of the reliability model is carried out. To build the model, applications are used that successfully passed the scoring procedure before changing the cutoff points.

3. Производится моделирование вероятности отказа на этапе андеррайтинга. Для этого дополнительно из предыдущей выборки исключаются заявки, по которым получен отказ на этапе применения моделей благонадежности.3. The simulation of the probability of failure at the stage of underwriting. To do this, in addition to the previous sample, applications are excluded for which a refusal was received at the stage of application of reliability models.

4. Осуществляется моделирование вероятности отказа на последующих этапах. Дополнительно исключаются заявки, по которым получен отказ на этапе андеррайтинга.4. The simulation of the probability of failure at subsequent stages is carried out. Additionally excluded are applications for which a rejection has been received at the underwriting stage.

5. Рассчитывается вероятность отказа послепрохождения процедуры скоринга для помеченных заявок, требующих отдельного моделирования (п. 1).5. The probability of failure after passing the scoring procedure is calculated for marked applications requiring separate modeling (paragraph 1).

[128] В целях определения вероятности отказа для заявок, по которым получен отказ на этапе скоринга, используется следующая формула:[128] In order to determine the probability of failure for applications for which a failure was received at the scoring stage, the following formula is used:

где P_blag - вероятность отказа для заявки по модели благонадежности; P_underr - вероятность отказа для заявки на этапе андеррайтинга; P_next - вероятность отказа для заявки на последующих этапах.where P _blag is the probability of failure for the application according to the reliability model; P _underr - probability of failure for the application at the underwriting stage; P _next - the probability of failure for the application in subsequent stages.

[129] Для определения вероятности одобрения по заявке вероятность отказа вычитается из единицы. После этого уровень одобрения рассчитывается как отношение количества одобренных заявок (суммы вероятностей одобрения) к числу всех заявок. При усреднении данного значения по рассматриваемому портфелю, получают уровень одобрения при выбранных зонах отсечения.[129] To determine the likelihood of approval on an application, the probability of rejection is deducted from the unit. After that, the approval level is calculated as the ratio of the number of approved applications (the sum of the probabilities of approval) to the number of all applications. By averaging this value over the portfolio under consideration, a level of approval is obtained for the selected cut-off zones.

[130] Если выбрать за целевой показатель уровень риска, то полученное значение необходимо умножить на уровень риска, получаемый из модели PD. При усреднении данного произведения получают уровень риска в выданном при выбранных зонах отсечения портфелю.[130] If you select a risk level for the target indicator, then the obtained value must be multiplied by the risk level obtained from the PD model. When averaging this product, the level of risk in the portfolio issued for the selected cut-off zones is obtained.

[131] После автоматического подбора зон отсечения происходит оптимизация этих зон по различным сегментам портфеля. Принцип работы алгоритма оптимизации построен на итеративном оптимизационном расчете оптимальных порогов принятия решения для отдельных сегментов клиентов с точки зрения соотношения «Уровень одобрения - уровень риска». Ниже приведены основные предпосылки, критичные для получаемых результатов работы алгоритма:[131] After the automatic selection of cut-off zones, these zones are optimized for different segments of the portfolio. The principle of operation of the optimization algorithm is based on an iterative optimization calculation of optimal decision thresholds for individual customer segments from the point of view of the ratio “Approval Level - Risk Level”. Below are the main premises that are critical for the obtained results of the algorithm:

1. Уровень риска оценивается как средний уровень вероятности просрочки внутри каждого сегмента.1. The risk level is estimated as the average level of probability of delay in each segment.

2. Прогноз вероятности просрочки делается на последних доступных данных с учетом сегментации.2. A forecast of the probability of delay is made on the latest available data, taking into account segmentation.

[132] Основная идея алгоритма расчета - итеративный сдвиг порога отсечения для отдельного клиентского сегмента, который в итоге приводит к повышению общего уровня одобрения при сохранении текущего уровня риска.[132] The main idea of the calculation algorithm is an iterative shift of the cutoff threshold for an individual client segment, which ultimately leads to an increase in the overall level of approval while maintaining the current level of risk.

[133] На каждой итерации алгоритма рассматривается оптимальный с точки возможного улучшения соотношения AR/DR клиентский сегмент, в рамках которого происходят операции «закрутка» - «раскрутка» в данной последовательности с предзаданным шагом в 15 баллов (данный шаг может наращиваться в соответствии с правилами формирования цикла, но не более, чем до 60 баллов). Таким образом, ищется оптимальная окрестность базового порога отсечения, приводящая к улучшению общего соотношения AR/DR.[133] At each iteration of the algorithm, the client segment that is optimal from the point of possible improvement of the AR / DR ratio is considered, within which there are “spin” - “spin” operations in this sequence with a predetermined step of 15 points (this step can be increased in accordance with the rules formation of the cycle, but not more than 60 points). Thus, an optimal neighborhood of the base cutoff threshold is sought, leading to an improvement in the overall AR / DR ratio.

[134] Далее происходит внедрение модели скоринга (или моделей) и зон отсечения в промышленную среду.[134] Next, the scoring model (or models) and cut-off zones are introduced into the industrial environment.

[135] В результате автоматического подбора уровней отсечения, целевой уровень одобрения может быть не достигнут с требуемой точностью. В результате необходимо адаптивно корректировать отсечения по скорингам для максимального приближения целевому AR. Для этого спустя 7 полных дней после последнего изменения целевого уровня одобрения или сразу после корректировки баллов отсечения без изменения целевого уровня одобрения начинается адаптивная корректировка полученных баллов отсечения. Она продолжается до тех пор, пока фактический уровень одобрения не войдет в допустимые границы хотя бы раз. Корректировка производится по следующей схеме. Прибавляем ко всем зона отсечения следующую величину:[135] As a result of automatic selection of cut-off levels, the target level of approval may not be achieved with the required accuracy. As a result, it is necessary to adaptively adjust clipping according to scores to maximize approximation to the target AR. To do this, 7 full days after the last change in the target approval level or immediately after adjusting the cut-off points without changing the target approval level, adaptive adjustment of the obtained cut-off points begins. It continues until the actual level of approval falls within acceptable limits at least once. Correction is performed as follows. Add to all the clipping zone the following value:

где Δ=AR-high_AR, если последний выход AR за установленные границы произошел в большую сторону;where Δ = AR-high _AR , if the last exit of AR beyond the established boundaries occurred upwards;

где Δ=low_AR-AR, если в меньшую. Размер корректировки задается экспертно в справочнике эмпирическим путем.where Δ = low _AR -AR, if less. The size of the adjustment is set expertly in the directory empirically.

[136] Корректировки запускаются ежедневно, до тех пор, пока уровень одобрения не вернется в допустимый интервал между значениями high_AR и low_AR.[136] Adjustments are triggered daily until the approval level returns to the acceptable interval between high _AR and low _AR .

[137] В некоторых вариантах осуществления проводится ежемесячная автоматическая валидация модели скоринга в соответствии с принятой в финансово-кредитном учреждении методологией. Если модель не проходит валидацию, она направляется на переобучение.[137] In some embodiments, a monthly automatic validation of the scoring model is carried out in accordance with the methodology adopted by the financial institution. If the model does not pass validation, it is sent for retraining.

[138] В некоторых вариантах осуществления проводят ежедневный мониторинг уровня одобрения. Данная методология подходит для наблюдения не только за уровнем одобрения, но и другими показателями, например, таким как риск. В рамках мониторинга рассматривается средний скользящий уровень одобрения с окном в 7 дней как временной ряд, элементы которого моделируются с помощью независимых нормально распределенных случайных величин. Для того чтобы поддерживать уровень одобрения на каком-либо целевом уровне, прежде всего, необходим критерий, с помощью которого можно понять, что изменение действительно произошло, так как данный показатель имеет естественные флуктуации. Исходя из этого, для выявления отклонений в целевом уровне одобрения может быть использован CUSUM-тест. Для этого определяют, что есть момент изменения уровня одобрения (разладки) - это момент, когда меняется закон распределения в потоке поступающих данных об уровне одобрения. В данном техническом решении рассматривается изменение среднего значения. Пусть X_n,n≥1 - последовательность наблюдений, которые моделируются с помощью независимых нормально распределенных случайных величин, θ∈[1, n] - неизвестный момент времени, в который меняется распределение наблюдений с ƒ₀~N(μ₀, σ²) на ƒ₁~N(μ₁, σ²), n - текущий момент времени. Так как точный момент времени разладки неизвестен, то гипотеза H₀ - разладки на отрезке [1, n] нет, а H₁ - разладка произошла на отрезке [1, n]. Чтобы различить две этих гипотезы, необходимо определить обобщенный критерий отношения правдоподобия:[138] In some embodiments, the level of approval is monitored daily. This methodology is suitable for monitoring not only the level of approval, but also other indicators, such as risk. In the framework of monitoring, the average moving approval level with a window of 7 days is considered as a time series, the elements of which are modeled using independent normally distributed random variables. In order to maintain the level of approval at any target level, first of all, a criterion is needed, with which you can understand that the change really happened, since this indicator has natural fluctuations. Based on this, a CUSUM test can be used to detect deviations in the target approval level. For this, it is determined that there is a moment of change in the level of approval (disorder) - this is the moment when the law of distribution in the flow of incoming data about the level of approval changes. This technical solution considers the change in the average value. Let X _n , n≥1 be a sequence of observations that are modeled using independent normally distributed random variables, θ∈ [1, n] be an unknown point in time at which the distribution of observations changes from ƒ ₀ ~ N (μ ₀ , σ ² ) on ƒ ₁ ~ N (μ ₁ , σ ² ), n is the current time. Since the exact time of the breakdown is not known, the hypothesis of H ₀ - breakdown on the interval [1, n] is not, and H ₁ - breakdown occurred on the segment [1, n]. To distinguish between these two hypotheses, it is necessary to determine a generalized likelihood ratio criterion:

[139] где с помощью C₀ контролируется число ложных срабатываний. Данное выражение известно как CUSUM-тест. Полученная запись теста будет вычислительно неэффективна, но в случае независимых случайных величин статистика может быть представлена рекуррентным соотношением:[139] where with the help of C ₀ the number of false positives is controlled. This expression is known as the CUSUM test. The resulting test record will be computationally inefficient, but in the case of independent random variables, statistics can be represented by a recurrence relation:

[140] Так как мы предполагаем, что ƒ₀ и ƒ₁ распределены нормально:[140] Since we assume that ƒ ₀ and ƒ _{1 are} distributed normally:

[141] Пусть μ₁=μ₀±δ, где δ - это допустимая погрешность, которая выбирается в зависимости от того, какое отклонение мы считаем приемлемым. Тогда выражение для вычисления CUSUM можно переписать в виде:[141] Let μ ₁ = μ ₀ ± δ, where δ is the permissible error, which is selected depending on which deviation we consider acceptable. Then the expression for calculating CUSUM can be rewritten as:

- для отклонений в сторону снижения и

- for deviations downward and

- для отклонений в сторону увеличения.

- for deviations upward.

[142] Итоговое решение находится из условия

[142] The final decision is found from the condition

[143] Описанный подход позволяет выявлять отклонения уровня одобрения от целевого уровня с минимальной задержкой и небольшим количеством ложных срабатываний.[143] The described approach allows detecting deviations of the approval level from the target level with a minimum delay and a small number of false positives.

[144] Если смена целевого уровня одобрения произошла менее чем 7 дней назад, мы не можем проводить CUSUM-тест, так как нет наблюдений скользящего среднего уровня одобрения за 7 дней, не включающие дни до корректировки. Кроме того, нужно застраховать себя от некорректной работы теста CUSUM. Для этого используется альтернативный, более простой тест, основанный на установке границ допустимого диапазона для наблюдаемого показателя.[144] If the change in the target approval level occurred less than 7 days ago, we cannot conduct the CUSUM test, as there are no observations of the moving average approval level for 7 days, not including the days before the adjustment. In addition, you need to insure yourself against the incorrect operation of the CUSUM test. For this, an alternative, simpler test is used, based on setting the boundaries of the acceptable range for the observed indicator.

[145] Аспекты настоящего изобретения могут быть также реализованы с помощью устройства обработки данных, являющимся вычислительной машиной или системой (или таких средств, как центральный/графический процессор или микропроцессор), которая считывает и исполняет программу, записанную на запоминающее устройство, чтобы выполнять функции вышеописанного варианта (ов) осуществления, и способа, показанного на Фиг. 1, этапы которого выполняются вычислительной машиной или устройством путем, например, считывания и исполнения программы, записанной на запоминающем устройстве, чтобы исполнять функции вышеописанного варианта (ов) осуществления. С этой целью программа записывается на вычислительную машину, например, через сеть или со среды для записи различных типов, служащей в качестве запоминающего устройства (например, машиночитаемой среды).[145] Aspects of the present invention can also be implemented using a data processing device that is a computer or system (or tools such as a central / graphics processor or microprocessor) that reads and executes a program recorded on a storage device to perform the functions of the above Embodiment (s) and the method shown in FIG. 1, the steps of which are performed by a computer or device by, for example, reading and executing a program recorded on a memory device to perform the functions of the above-described embodiment (s) of implementation. To this end, the program is recorded on a computer, for example, through a network or from a recording medium of various types, serving as a storage device (for example, a computer-readable medium).

[146] Устройство обработки данных может иметь дополнительные особенности или функциональные возможности. Например, устройство обработки данных может также включать в себя дополнительные устройства хранения данных (съемные и несъемные), такие как, например, магнитные диски, оптические диски или лента. Устройства хранения данных могут включать в себя энергозависимые и энергонезависимые, съемные и несъемные носители, реализованные любым способом или при помощи любой технологии для хранения информации, такой как машиночитаемые инструкции, структуры данных, программные модули или другие данные. Устройство хранения данных, съемное хранилище и несъемное хранилище являются примерами компьютерных носителей данных. Компьютерные носители данных включают в себя, но не в ограничительном смысле, оперативное запоминающее устройство (ОЗУ), постоянное запоминающее устройство (ПЗУ), электрически стираемое программируемое ПЗУ (EEPROM), флэш-память или память, выполненную по другой технологии, ПЗУ на компакт-диске (CD-ROM), универсальные цифровые диски (DVD) или другие оптические запоминающие устройства, магнитные кассеты, магнитные ленты, хранилища на магнитных дисках или другие магнитные запоминающие устройства, или любую другую среду, которая может быть использована для хранения желаемой информации и к которой может получить доступ устройство обработки данных. Устройство обработки данных может также включать в себя устройство (а) ввода, такие как клавиатура, мышь, перо, устройство с речевым вводом, устройство сенсорного ввода, и так далее. Устройство (а) вывода, такие как дисплей, динамики, принтер и тому подобное, также могут быть включены в состав системы.[146] The data processing device may have additional features or functionality. For example, the data processing device may also include additional data storage devices (removable and non-removable), such as, for example, magnetic disks, optical disks or tape. Storage devices may include volatile and non-volatile, removable and non-removable media implemented in any way or using any technology for storing information, such as machine-readable instructions, data structures, program modules or other data. A storage device, removable storage, and non-removable storage are examples of computer storage media. Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact ROM a disc (CD-ROM), universal digital disks (DVDs) or other optical storage devices, magnetic tapes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to store the desired information and to which the data processing device can access. The data processing device may also include an input device (s), such as a keyboard, mouse, pen, voice input device, touch input device, and so on. An output device (a), such as a display, speakers, printer, and the like, may also be included in the system.

[147] Устройство обработки данных содержит коммуникационные соединения, которые позволяют устройству связываться с другими вычислительными устройствами, например по сети. Сети включают в себя локальные сети и глобальные сети наряду с другими большими масштабируемыми сетями, включая, но не в ограничительном смысле, корпоративные сети и экстрасети. Коммуникационное соединение является примером коммуникационной среды. Как правило, коммуникационная среда может быть реализована при помощи машиночитаемых инструкций, структур данных, программных модулей или других данных в модулированном информационном сигнале, таком как несущая волна, или в другом транспортном механизме, и включает в себя любую среду доставки информации. Термин «модулированный информационный сигнал» означает сигнал, одна или более из его характеристик изменены или установлены таким образом, чтобы закодировать информацию в этом сигнале. Для примера, но без ограничения, коммуникационные среды включают в себя проводные среды, такие как проводная сеть или прямое проводное соединение, и беспроводные среды, такие как акустические, радиочастотные, инфракрасные и другие беспроводные среды. Термин «машиночитаемый носитель», как употребляется в этом документе, включает в себя как носители данных, так и коммуникационные среды. Последовательности процессов, описанных в этом документе, могут выполняться с использованием аппаратных средств, программных средств или их комбинации. Когда процессы выполняются с помощью программных средств, программа, в которой записана последовательность процессов, может быть установлена и может выполняться в памяти компьютера, встроенного в специализированное аппаратное средство, или программа может быть установлена и может выполняться на компьютер общего назначения, который может выполнять различные процессы.[147] The data processing device comprises communication connections that allow the device to communicate with other computing devices, for example over a network. Networks include local area networks and wide area networks along with other large, scalable networks, including, but not limited to, corporate networks and extranets. Communication connection is an example of a communication environment. Typically, a communication medium can be implemented using computer-readable instructions, data structures, program modules or other data in a modulated information signal, such as a carrier wave, or in another transport mechanism, and includes any information delivery medium. The term "modulated information signal" means a signal, one or more of its characteristics are changed or set in such a way as to encode information in this signal. By way of example, but without limitation, communication media include wired media such as a wired network or a direct wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media. The term “machine-readable medium”, as used herein, includes both storage media and communication media. The process sequences described in this document may be performed using hardware, software, or a combination thereof. When processes are performed using software, the program in which the sequence of processes is recorded can be installed and can be executed in the memory of a computer built into specialized hardware, or the program can be installed and can be run on a general-purpose computer that can perform various processes .

[148] Например, программа может быть заранее записана на носитель записи, такой как жесткий диск, или ПЗУ (постоянное запоминающее устройство). В качестве альтернативы, программа может быть временно или постоянно сохранена (записана) на съемном носителе записи, таком как гибкий диск, CD-ROM (компакт-диск, предназначенный только для воспроизведения), МО (магнитооптический) диск, DVD (цифровой универсальный диск), магнитный диск или полупроводниковая память. Съемный носитель записи может распространяться в виде так называемого, продаваемого через розничную сеть программного средства.[148] For example, a program may be pre-recorded on a recording medium such as a hard disk or ROM (read only memory). Alternatively, the program can be temporarily or permanently saved (recorded) on a removable recording medium such as a floppy disk, CD-ROM (compact disc, designed for playback only), MO (magneto-optical) disc, DVD (digital universal disc) , magnetic disk or semiconductor memory. Removable recording media may be distributed as so-called software sold through a retail network.

[149] Программа может быть установлена со съемного носителя записи, описанного выше, на компьютер, или может быть передана по кабелю с сайта загрузки в компьютер или может быть передана в компьютер по сетевым каналам передачи данных, таким как ЛВС (локальная вычислительная сеть) или Интернет. Компьютер может принимать переданную, таким образом, программу и может устанавливать ее на носитель записи, такой как встроенный жесткий диск.[149] The program may be installed from a removable recording medium described above to a computer, or may be transmitted via cable from a download site to a computer, or may be transmitted to a computer via network data channels such as a LAN (local area network) or The Internet. The computer can receive the program transmitted in this way and can install it on a recording medium such as an internal hard drive.

[150] Процессы, описанные в этом документе, могут выполняться последовательно по времени, в соответствии с описанием, или могут выполняться параллельно или отдельно, в зависимости от характеристик обработки устройства, выполняющего процессы, или в соответствии с необходимостью. Система, описанная в этом документе, представляет собой логический набор множества устройств и не ограничивается структурой, в которой эти устройства установлены в одном корпусе.[150] The processes described in this document can be performed sequentially in time, in accordance with the description, or can be performed in parallel or separately, depending on the processing characteristics of the device that performs the processes, or in accordance with the need. The system described in this document is a logical set of multiple devices and is not limited to the structure in which these devices are installed in one enclosure.

Claims

1. A computerized way to develop and manage scoring models, which includes the following steps:

- receive data for a given period of time containing factors affecting the scoring model;

- carry out the division of the data into samples for the development, validation and testing of the scoring model;

- carry out the transformation of factors by establishing relationships between groups of values of the converted factor and default levels;

- exclude from the samples at least one transformed factor that correlates with at least one other factor;

- form a credit scoring model through the training of binary multiple logistic regression;

- automatically select clipping zones for at least one scoring model for its installation in the credit procedure.

2. The method according to p. 1, characterized in that they receive data for a given period of time from the user's mobile communication device.

3. The method according to claim 1, characterized in that when dividing the obtained data into samples, parts of the original population that do not intersect in time or random subsamples are obtained.

4. The method according to claim 1, characterized in that the factors affecting the scoring model are annual income and / or the amount of outstanding debt, and / or ownership of real estate, and / or possession of a car, and / or work experience in last place, and / or age.

5. The method according to p. 1, characterized in that the factors affecting the scoring model are discrete or continuous.

6. The method according to p. 1, characterized in that during the transformation of factors determine the degree of deviation of the level of defaults in the data group from the average level of defaults throughout the sample.

7. The method according to claim 1, characterized in that when transforming the factors according to the factors included in the excluded list, the algorithm for splitting the values of the factors with a new set of settings is launched.

8. The method according to p. 1, characterized in that, when excluding transformed factors from the samples, a table is formed with the values of the pair correlation coefficients of the transformed factors.

9. The method according to p. 1, characterized in that when excluding transformed factors from the samples in the cycle, the factor that has the largest number of factors correlated with it is selected.

10. The method according to p. 1, characterized in that when forming the credit scoring model, a logistic model is built using step-by-step regression to select the final set of factors.