WO2018120726A1 - 基于数据挖掘的建模方法、系统、电子装置及存储介质 - Google Patents

基于数据挖掘的建模方法、系统、电子装置及存储介质 Download PDF

Info

Publication number
WO2018120726A1
WO2018120726A1 PCT/CN2017/091374 CN2017091374W WO2018120726A1 WO 2018120726 A1 WO2018120726 A1 WO 2018120726A1 CN 2017091374 W CN2017091374 W CN 2017091374W WO 2018120726 A1 WO2018120726 A1 WO 2018120726A1
Authority
WO
WIPO (PCT)
Prior art keywords
indicator
group
candidate
model
distance
Prior art date
Application number
PCT/CN2017/091374
Other languages
English (en)
French (fr)
Inventor
陈依云
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2018120726A1 publication Critical patent/WO2018120726A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • the present invention relates to the field of data mining technologies, and in particular, to a data mining based modeling method, system, electronic device, and computer readable storage medium.
  • the number of alternative modeling indicators is usually collected, sometimes as many as 200 or more, but usually only a part of the effective modeling, for example, in 200 alternative construction Only 30 of the model indicators may be valid.
  • the existing method is to manually select high-correlation indicators for modeling. This manual selection method is subjective. Therefore, the effective indicators of modeling cannot be accurately selected, and the efficiency of modeling is low.
  • the object of the present invention is to provide a data mining based modeling method, system, electronic device and computer readable storage medium, which aims to accurately select the weakest candidate index and improve the modeling efficiency.
  • the present invention provides a data mining based modeling method, and the data mining based modeling method includes:
  • the present invention also provides a data mining-based modeling apparatus, the data mining-based modeling apparatus comprising:
  • a equalization module configured to divide the candidate indicators into K indicator groups after receiving the candidate indicators to be screened
  • a calculation module configured to calculate an intra-group distance D1 and an inter-group distance D2 of each candidate indicator in each indicator group, and calculate each candidate indicator according to the intra-group distance D1 and the inter-group distance D2 and based on a predetermined calculation rule Screening evaluation value A;
  • Establishing a module configured to select an candidate indicator according to the screening evaluation value A, based on the K value And use the selected alternative indicators to establish an indicator model.
  • the present invention also provides an electronic device including a memory and a processor coupled to the memory, the memory storing a data mining-based modeling system operable on the processor, The data mining based modeling system is implemented by the processor to implement the following steps:
  • the present invention also provides a computer readable storage medium having a data mining based modeling system stored thereon, the data mining based modeling system being implemented by a processor The following steps:
  • the beneficial effects of the present invention are as follows: after dividing the candidate index into a plurality of indicator groups, first calculating the intra-group distance D1 and the inter-group distance D2 of each candidate index in each indicator group, according to the intra-group distance D1
  • the screening evaluation value A is calculated by the distance D2 between the groups. Since the screening evaluation value A comprehensively considers the intra-group distance D1 and the inter-group distance D2 of the candidate index, the candidate index with the least correlation can be selected according to the screening evaluation value A.
  • the selected indicators are the most representative or most effective indicators, without manual selection, the accuracy of selection is high, and the modeling efficiency is high.
  • FIG. 1 is a schematic flow chart of a first embodiment of a data mining based modeling method according to the present invention
  • step S2 is a schematic diagram of a refinement process of step S2 shown in FIG. 1;
  • step S3 is a schematic diagram of a refinement process of step S3 shown in FIG. 1;
  • FIG. 4 is a schematic flow chart of a second embodiment of a data mining based modeling method according to the present invention.
  • FIG. 5 is a schematic diagram of an application environment of an embodiment of a data mining based modeling method according to the present invention.
  • FIG. 6 is a schematic structural diagram of an embodiment of a data mining based modeling system according to the present invention.
  • FIG. 7 is a schematic structural diagram of the computing module shown in FIG. 6;
  • FIG. 8 is a schematic structural view of the building module shown in FIG. 6.
  • FIG. 1 is a schematic flowchart of an embodiment of a data mining based modeling method according to an embodiment of the present invention.
  • the data mining based modeling method is applied to an electronic device, and includes the following steps:
  • Step S1 after receiving the candidate indicators to be selected, dividing the candidate indicators into K indicator groups;
  • the present embodiment randomly divides the candidate indicators into K indicator groups to perform cluster analysis on the candidate indicators.
  • K is a natural number greater than 1, for example, there are 150 candidate indicators. If K is 10, then the score is randomly divided into 10 indicator groups, and there are 15 candidate indicators in each indicator group.
  • 150 candidate indicators for example, there are 200 initial candidate indicators, and 150 candidate indicators can be initially selected by stepping back to the forward and backward methods and setting appropriate parameters.
  • the candidate indicators include demographic characteristics, life stage characteristics, customer value information, product holding status, insurance behavior habits, historical claims related information, and the like.
  • Step S2 calculating an intra-group distance D1 and an inter-group distance D2 for each candidate index in each index group, and calculating a corresponding correspondence of each candidate index according to the intra-group distance D1 and the inter-group distance D2 and based on a predetermined calculation rule.
  • the intra-group distance D1 refers to the correlation coefficient between the candidate index variable and the group center set. The larger the distance D1 within the group, the greater the correlation between the candidate index and the group center set.
  • the group center set is determined by the mean of the candidate indicators in each indicator group.
  • the inter-group distance D2 refers to the correlation coefficient between the candidate index variable and the center of the group closest to the group. The smaller the distance D2 between the groups, the greater the correlation between the candidate index and the center of the group closest to the group. .
  • the screening evaluation value A is calculated according to the intra-group distance D1 and the inter-group distance D2 of each candidate index, the intra-group distance D1 and the inter-group distance D2 of each candidate index are simultaneously considered, and the calculated screening evaluation value A has comprehensive Sex and purpose.
  • Step S3 selecting an alternative indicator according to the screening evaluation value A, and establishing an indicator model based on the K value and using the selected candidate indicator.
  • the candidate index with the least correlation may be selected, for example, selecting the corresponding candidate index with the largest screening evaluation value A and selecting the screening evaluation.
  • the corresponding candidate index with the smallest value A selects the corresponding 10 candidate indicators with the largest evaluation value A and the corresponding 10 candidate indicators with the smallest evaluation value A.
  • the established model may be, for example, a logistic regression model, a decision tree model, or a neural network model.
  • the model is established according to the number K of the indicator group. For example, when the K value is small, a certain model or a certain type can be established. When the K value is greater than a certain threshold, another model or another model can be established, that is, mainly according to the indicator. The number of groups to determine the model established.
  • the first embodiment calculates the intra-group distance D1 and the inter-group distance D2 of each candidate index in each indicator group, according to the intra-group distance.
  • the screening evaluation value A is calculated by D1 and the inter-group distance D2. Since the screening evaluation value A comprehensively considers the intra-group distance D1 and the inter-group distance D2 of the candidate index, the least relevant candidate can be selected according to the screening evaluation value A.
  • the indicator that is, the selected candidate index is the most representative or most effective indicator, without manual selection, the accuracy of selection is high, and the modeling efficiency is high.
  • step S2 includes:
  • the group center of the combination of the five candidate index variables is the mean of each component of the five candidate index variables:
  • M (0.043906, 0.125792, -1.22313, 0.110018, -1.12762, -0.85941, - 1.13551, 0.871851, -0.8972, -0.45401).
  • the distance between the alternative indicator variable X1 and the center of the group Assume Is the mean of the alternative indicator variable X1, It is the mean of the group center set M, and n is the number of samples (the number of indicator groups).
  • the average value of X1 can be calculated as -0.73831, and the mean value of M is -0.45473.
  • This distance D1 is the intra-group distance D1 of the alternative index variable X1.
  • the intra-group distance D1 of each candidate index variable can be calculated.
  • m pi is the respective components of the center M P of each indicator group
  • m qi is the respective components of the center M Q of the other indicator groups.
  • the foregoing step S3 includes:
  • each indicator group selecting at least one candidate indicator corresponding to the maximum screening evaluation value and at least one candidate indicator corresponding to the minimum screening evaluation value;
  • K value is greater than or equal to a preset threshold, use a candidate indicator selected by each indicator group to establish a predetermined indicator model;
  • step S33 If the K value is less than a preset threshold, increase the K value, recalculate the screening evaluation value, and perform step S31 to establish a predetermined another indicator model by using the candidate indicator selected by each indicator group.
  • At least one candidate indicator with the largest screening evaluation value and at least one candidate indicator with the smallest screening evaluation value may be selected for each indicator group, so that the correlation between the selected candidate indicators is the most weak. If the correlation between the selected alternative indicators is the weakest, then the selected alternative indicator is the most representative or most effective indicator.
  • the candidate indicator selected by each indicator group is used to establish a predetermined indicator model; if K is less than the preset threshold, K is increased by 1, and the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
  • the preset threshold for example, the preset threshold is 15
  • K is less than the preset threshold
  • K is increased by 1
  • the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
  • the method further includes:
  • S4 Verify the established indicator model by using predetermined verification data samples, and apply the indicator model with the highest accuracy after verification as the reference model.
  • the accuracy of the model can be verified.
  • the established models can be verified by using predetermined verification data samples to determine the accuracy of each model, and then the model with the highest accuracy is used as the reference model.
  • the indicator model with the highest accuracy is 1, the indicator model with the highest accuracy is applied as the reference model;
  • the indicator model with the highest accuracy is greater than 1, randomly select the indicator model with the highest accuracy as the reference model, or increase the number of verification data samples until the number of the highest accuracy indicator model is 1, and The indicator model with the highest accuracy is applied as a reference model.
  • FIG. 5 is a schematic diagram of an application environment of a preferred embodiment of the data mining based modeling method of the present invention.
  • the application environment diagram includes an electronic device 1 and a terminal device 2.
  • the electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
  • the terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, or a smart phone.
  • PDA Personal Digital Assistant
  • game consoles Internet Protocol Television (IPTV)
  • IPTV Internet Protocol Television
  • smart wearable devices navigation devices, etc.
  • mobile devices such as digital TVs, desktop computers, Fixed terminal for notebooks, servers, etc.
  • the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
  • the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing.
  • a super virtual computer consisting of a group of loosely coupled computers.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 communicably connected to each other through a system bus, and the memory 11 stores data-based mining that can be run on the processor 12. Modeling system. It is to be noted that FIG. 5 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
  • the storage device 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the electronic device 1;
  • the readable storage medium may be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM).
  • a non-volatile storage medium such as a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, or the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
  • a storage device such as a plug-in equipped on the electronic device 1 Connected hard drives, Smart Memory Cards (SMC), Secure Digital (SD) cards, Flash Cards, etc.
  • the readable storage medium of the storage device 11 is generally used to store an operating system installed on the electronic device 1 and various types of application software, such as program code of a data mining-based modeling system in an embodiment of the present invention. . Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with the terminal device 2.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as running a data mining based modeling system or the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
  • the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.
  • the data mining based modeling system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement various embodiments of the present invention
  • the method of data mining based modeling; as described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the candidate indicators are equally divided into K indicator groups;
  • the intra-group distance D1 and the inter-group distance D2 of each candidate index are calculated according to the intra-group distance D1 and the inter-group distance D2 and the screening evaluation value A corresponding to each candidate index is calculated according to a predetermined calculation rule;
  • the evaluation value A selects an alternative indicator, and based on the K value and using the selected candidate indicator to establish an indicator model, the candidate index with the weakest correlation can be accurately selected to improve the modeling efficiency.
  • FIG. 6 is a schematic structural diagram of an embodiment of a data mining-based modeling system, where the data mining-based modeling system runs in an electronic device, and the data mining-based modeling system is based on Its different functions can be divided into multiple functional modules.
  • the data mining-based modeling system includes:
  • the equalization module 101 is configured to divide the candidate indicators into K indicator groups after receiving the candidate indicators to be selected;
  • the present embodiment randomly divides the candidate indicators into K indicator groups to perform cluster analysis on the candidate indicators.
  • K is a natural number greater than 1, for example, there are 150 candidate indicators. If K is 10, then the score is randomly divided into 10 indicator groups, and there are 15 candidate indicators in each indicator group.
  • the candidate indicators include demographic characteristics, life stage characteristics, customer value information, product holding status, insurance behavior habits, historical claims related information, and the like.
  • the calculation module 102 is configured to calculate an intra-group distance D1 and an inter-group distance D2 for each candidate indicator in each indicator group, and calculate each candidate index according to the intra-group distance D1 and the inter-group distance D2 and based on a predetermined calculation rule.
  • the intra-group distance D1 refers to the correlation coefficient between the candidate index variable and the group center set. The larger the distance D1 within the group, the greater the correlation between the candidate index and the group center set.
  • the group center set is determined by the mean of the candidate indicators in each indicator group.
  • the inter-group distance D2 refers to the correlation coefficient between the candidate index variable and the center of the group closest to the group. The smaller the distance D2 between the groups, the greater the correlation between the candidate index and the center of the group closest to the group. .
  • the screening evaluation value A is calculated according to the intra-group distance D1 and the inter-group distance D2 of each candidate index, the intra-group distance D1 and the inter-group distance D2 of each candidate index are simultaneously considered, and the calculated screening evaluation value A has comprehensive Sex and purpose.
  • the establishing module 103 is configured to select an candidate indicator according to the screening evaluation value A, and establish an indicator model based on the K value and using the selected candidate indicator.
  • the candidate index with the least correlation may be selected, for example, selecting the corresponding candidate index with the largest screening evaluation value A and selecting the screening evaluation.
  • the corresponding candidate index with the smallest value A selects the corresponding 10 candidate indicators with the largest evaluation value A and the corresponding 10 candidate indicators with the smallest evaluation value A.
  • the established model may be, for example, a logistic regression model, a decision tree model, or a neural network model.
  • the model is established according to the number K of the indicator group. For example, when the K value is small, a certain model or a certain type can be established. When the K value is greater than a certain threshold, another model or another model can be established, that is, mainly according to the indicator. The number of groups to determine the model established.
  • the calculation module 102 includes:
  • a first calculating unit 1021 configured to calculate an average value of the candidate indicators under each indicator group, obtain a group center set according to the average value, and calculate a distance between each candidate indicator and the group center set according to the group center set Calculating the distance as the intra-group distance D1;
  • the second calculating unit 1022 is configured to calculate a center distance between the indicator group in which each candidate indicator is located and other indicator groups, and obtain a corresponding indicator group with the smallest distance from the center distance, and calculate the group according to the obtained indicator group.
  • the group center of the combination of the five candidate index variables is the mean of each component of the five candidate index variables:
  • M (0.043906, 0.125792, -1.22313, 0.110018, -1.12762, -0.85941, - 1.13551, 0.871851, -0.8972, -0.45401).
  • the distance between the alternative indicator variable X1 and the center of the group Assume Is the mean of the alternative indicator variable X1, It is the mean of the group center set M, and n is the number of samples (the number of indicator groups).
  • the average value of X1 can be calculated as -0.73831, and the mean value of M is -0.45473.
  • This distance D1 is the intra-group distance D1 of the alternative index variable X1.
  • the intra-group distance D1 of each candidate index variable can be calculated.
  • m pi is the respective components of the center M P of each indicator group
  • m qi is the respective components of the center M Q of the other indicator groups.
  • the module 103 is created.
  • the selecting unit 1031 is configured to select, in each indicator group, at least one candidate indicator corresponding to the maximum screening evaluation value and at least one candidate indicator corresponding to the minimum screening evaluation value;
  • the first establishing unit 1032 is configured to: when the K value is greater than or equal to a preset threshold, use a candidate indicator selected by each indicator group to establish a predetermined indicator model;
  • the second establishing unit 1033 is configured to: if the K value is less than a preset threshold, increase the K value, recalculate the screening evaluation value, and select the selected indicator to use the candidate selected by each indicator group The indicator establishes another predetermined indicator model.
  • At least one candidate indicator with the largest screening evaluation value and at least one candidate indicator with the smallest screening evaluation value may be selected for each indicator group, so that the correlation between the selected candidate indicators is the most weak. If the correlation between the selected alternative indicators is the weakest, then the selected alternatives Marked as the most representative or most effective indicator.
  • the candidate indicator selected by each indicator group is used to establish a predetermined indicator model; if K is less than the preset threshold, K is increased by 1, and the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
  • the preset threshold for example, the preset threshold is 15
  • K is less than the preset threshold
  • K is increased by 1
  • the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
  • the data mining based modeling system further includes: a verification module, configured to verify the established indicator model by using predetermined verification data samples.
  • the indicator model with the highest accuracy after verification is applied as a benchmark model.
  • the accuracy of the model can be verified.
  • the established models can be verified by using predetermined verification data samples to determine the accuracy of each model, and then the model with the highest accuracy is used as the reference model.
  • the verification module is specifically configured to apply the indicator model with the highest accuracy as the reference model if the number of the indicator model with the highest accuracy is 1, and randomly select if the number of the indicator model with the highest accuracy is greater than 1.
  • An indicator model with the highest accuracy rate is applied as a benchmark model, or the number of verification data samples is increased until the number of indicator models with the highest accuracy is 1, and the index model with the highest accuracy is applied as a reference model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明涉及一种基于数据挖掘的建模方法、系统、电子装置及计算机可读存储介质,所述基于数据挖掘的建模方法包括:在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。本发明能够准确地选出相关性最弱的备选指标,提高建模效率。

Description

基于数据挖掘的建模方法、系统、电子装置及存储介质
优先权申明
本申请基于巴黎公约申明享有2016年12月30日递交的申请号为CN201611263812.0、名称为“基于数据挖掘的建模方法及装置”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本发明涉及数据挖掘技术领域,尤其涉及一种基于数据挖掘的建模方法、系统、电子装置及计算机可读存储介质。
背景技术
目前,在与数据挖掘相关的建模中,通常收集到的备选建模指标数量较多,有时多达200个以上,但通常对建模有效的一般只有一部分,例如在200个备选建模指标中可能只有30个是有效的。为了从大量的备选建模指标中筛选出建模所需的有效指标,现有的方法是人工手动选出高相关度的指标进行建模,这种人工选择的方法由于带有主观性,因此不能准确地选出建模的有效指标,且建模的效率低。
发明内容
本发明的目的在于提供一种基于数据挖掘的建模方法、系统、电子装置及计算机可读存储介质,旨在准确地选出相关性最弱的备选指标,提高建模效率。
为实现上述目的,本发明提供一种基于数据挖掘的建模方法,所述基于数据挖掘的建模方法包括:
S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
为实现上述目的,本发明还提供一种基于数据挖掘的建模装置,所述基于数据挖掘的建模装置包括:
均分模块,用于在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
计算模块,用于计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
建立模块,用于根据所述筛选评价值A选择备选指标,基于所述K值 并利用所选择的备选指标建立指标模型。
为实现上述目的,本发明还提供一种电子装置,所述电子装置包括存储器及与存储器连接的处理器,所述存储器存储有可在所述处理器上运行的基于数据挖掘的建模系统,所述基于数据挖掘的建模系统被所述处理器执行时实现如下步骤:
S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
为实现上述目的,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有基于数据挖掘的建模系统,所述基于数据挖掘的建模系统被处理器执行时实现以下步骤:
S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
本发明的有益效果是:本发明在将备选指标均分为若干个指标群后,首先计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据群内距离D1和群间距离D2计算得到筛选评价值A,由于筛选评价值A综合考虑备选指标的群内距离D1和群间距离D2,因此,根据筛选评价值A可以选出相关性最小的备选指标,即选出的备选指标为最具有代表性或者最有效的指标,不需人工手动选取,选取的准确性高,且建模效率高。
附图说明
图1为本发明基于数据挖掘的建模方法第一实施例的流程示意图;
图2为图1所示步骤S2的细化流程示意图;
图3为图1所示步骤S3的细化流程示意图;
图4为本发明基于数据挖掘的建模方法第二实施例的流程示意图;
图5为本发明基于数据挖掘的建模方法的一实施例的应用环境示意图;
图6为本发明基于数据挖掘的建模系统一实施例的结构示意图;
图7为图6所示计算模块的结构示意图;
图8为图6所示建立模块的结构示意图。
具体实施方式
以下结合附图对本发明的原理和特征进行描述,所举实例只用于解释本 发明,并非用于限定本发明的范围。
如图1所示,图1为本发明基于数据挖掘的建模方法一实施例的流程示意图,该基于数据挖掘的建模方法应用于电子装置中,包括以下步骤:
步骤S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
本实施例在接收到待筛选的备选指标后,将备选指标随机均分成K个指标群,以对备选指标进行聚类分析。其中,K为大于1的自然数,例如,共有150个备选指标,若K为10,则随机均分成10个指标群,每个指标群中有15个备选指标。
其中,在接收到150个备选指标之前,例如初始备选指标有200个,可以通过逐步回归向前向后的方法,设置合适的参数来初步选出150个备选指标。
其中,以建立客户是否发生理赔的模型为例,备选指标包括人口统计特征、生命阶段特征、客户价值信息、产品持有情况、投保行为习惯、历史理赔相关信息等等。
步骤S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
本实施例中,群内距离D1指的是备选指标变量与群中心集合的相关系数,该群内距离D1越大,则说明该备选指标与群中心集合的相关性越大。群中心集合由各指标群中的备选指标的均值决定。
群间距离D2指的是备选指标变量与离群最近的群的中心的相关系数,该群间距离D2越小,则说明该备选指标与离群最近的群的中心的相关性越大。
根据各备选指标的群内距离D1和群间距离D2计算筛选评价值A时,同时考虑各备选指标的群内距离D1和群间距离D2,所计算得出的筛选评价值A具备综合性及目的性。
步骤S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
本实施例中的筛选评价值A,在根据筛选评价值A选择备选指标时,可选择出相关性最小的备选指标,例如选择筛选评价值A最大的对应的备选指标及选择筛选评价值A最小的对应的备选指标,选择筛选评价值A最大的对应的10个备选指标及筛选评价值A最小的对应的10个备选指标。
另外,所建立的模型例如可以是逻辑回归模型、决策树模型或神经网络模型等。根据指标群的数量K建立模型,例如,在K值较小时可以建立某种模型或某几种,当K值大于某个阈值时可以建立另一种模型或另几种模型,即主要根据指标群的数量来确定所建立的模型。
与现有技术相比,本实施例在将备选指标均分为若干个指标群后,首先计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据群内距离 D1和群间距离D2计算得到筛选评价值A,由于筛选评价值A综合考虑备选指标的群内距离D1和群间距离D2,因此,根据筛选评价值A可以选出相关性最小的备选指标,即选出的备选指标为最具有代表性或者最有效的指标,不需人工手动选取,选取的准确性高,且建模效率高。
在一优选的实施例中,如图2所示,在上述图1的实施例的基础上,步骤S2包括:
S21,计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;
S22,计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;
S23,计算所述所述筛选评价值A:A=(1-D1)/(1-D2)。
本实施例中,假设有5个备选指标变量X1、X2、X3、X4、X5,其中Xi=(Xi1,Xi2,…,Xin),n=10,如下表1所示:
X1 X2 X3 X4 X5
-0.02106 -0.02075 -0.00183 -0.2542 0.517368
-0.02106 -0.02075 -0.00183 0.305505 0.367093
-1.54935 -1.54959 -1.49993 -1.00909 -0.51768
-0.02106 -0.02075 0.316522 0.305505 -0.03013
-1.54935 -1.54959 -1.49993 -1.00909 -0.03013
-1.54935 -1.54959 -1.49993 -0.2542 0.556034
-1.54935 -1.54959 -1.49993 -0.2542 -0.8245
0.936479 0.937007 0.909081 1.020655 0.556034
-1.54935 -1.54959 -1.49993 -0.2542 0.367093
-0.50968 -0.50945 -0.47902 -0.2542 -0.51768
表1
其中,这五个备选指标变量组合的群中心是5个备选指标变量的各分量的均值:
M=(m1,m2,…,mn),其中
Figure PCTCN2017091374-appb-000001
其中,m1=(-0.02106-0.02075-0.00183-0.2542+0.517368)/5=0.043906;m2=(-0.02106-0.02075-0.00183+0.305505+0.367093)/5=0.125792;此时可以计算出这五个备选指标变量的中心(即群中心集合)为:
M=(0.043906,0.125792,-1.22513,0.110018,-1.12762,-0.85941,-1.13551,0.871851,-0.8972,-0.45401)。
由上可以得到,备选指标变量X1与群中心的距离:
Figure PCTCN2017091374-appb-000002
Figure PCTCN2017091374-appb-000003
是备选指标变量X1的均值,
Figure PCTCN2017091374-appb-000004
就是群中心集合M的均值,n是样本的个数(指标群的数量),可以计算出X1的均值为-0.73831,M的均值为-0.45473。该距离D1即为备选指标变量X1的群内距离D1。以此类推,可以计算得到各备选指标变量的群内距离D1。
在计算群间距离时,首先计算备选指标变量所在的指标群中心与其他指标群中心的距离:
Figure PCTCN2017091374-appb-000005
这里mpi是各指标群的中心MP的各个分量,mqi是其他指标群的中心MQ的各个分量。
从上述距离d中找出备选指标变量与离群最近的指标群的中心,然后根据备选指标变量与群中心的距离公式计算出备选指标变量的群间距离:
Figure PCTCN2017091374-appb-000006
最后,计算筛选评价值A:A=(1-D1)/(1-D2),另外,筛选评价值A也可以用这种方法计算得到:A=(1-D2)/(1-D1)。
在一优选的实施例中,如图3所示,在上述图1的实施例的基础上,上述步骤S3包括:
S31,在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;
S32,若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;
S33,若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并执行步骤S31,以利用各指标群选出的备选指标建立预定的另一指标模型。
本实施例中,可以为每一指标群选出筛选评价值最大的至少一个备选指标和筛选评价值最小的至少一个备选指标,以使得所选出的备选指标之间的相关性最弱。如果所选出的备选指标之间的相关性最弱,则所选出的备选指标为最具有代表性或者最有效的指标。
本实施例中,如果K值大于等于预设阈值(例如预设阈值为15)时,则利用各指标群挑选出的备选指标建立预定的一指标模型;若K小于预设阈值,则将K增加1,并重新将备选指标均分成(K+1)个指标群,然后计算对应的群内距离D1、群间距离D2及筛选评价值A,根据筛选评价值A选择备选指标,以建立另一个预先确定的模型。
在一优选的实施例中,如图4所示,在上述图1的实施例的基础上,在上述步骤S3之后还包括:
S4,利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
本实施例中,在建立模型之后,可以对模型的准确性进行验证。例如可以利用预先确定的验证数据样本对建立的各个模型进行验证,以确定各个模型对应的准确率,然后将准确率最高的模型作为基准模型进行应用。
优选地,如果准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;
若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
如图5所示,是本发明基于数据挖掘的建模方法的较佳实施例的应用环境示意图。该应用环境示意图包括电子装置1及终端设备2。电子装置1可以通过网络、近场通信技术等适合的技术与终端设备2进行数据交互。
所述终端设备2包括,但不限于,任何一种可与用户通过键盘、鼠标、遥控器、触摸板或者声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备、导航装置等等的可移动设备,或者诸如数字TV、台式计算机、笔记本、服务器等等的固定终端。
该电子装置1是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。所述电子装置1可以是计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。
在本实施例中,电子装置1可包括,但不仅限于,可通过系统总线相互通信连接的存储器11、处理器12及网络接口13,存储器11存储有可在处理器12上运行的基于数据挖掘的建模系统。需要指出的是,图5仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,存储设备11包括内存及至少一种类型的可读存储介质。内存为电子装置1的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等的非易失性存储介质。在一些实施例中,可读存储介质可以是电子装置1的内部存储单元,例如该电子装置1的硬盘;在另一些实施例中,该非易失性存储介质也可以是电子装置1的外部存储设备,例如电子装置1上配备的插 接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。本实施例中,存储设备11的可读存储介质通常用于存储安装于电子装置1的操作系统和各类应用软件,例如本发明一实施例中的基于数据挖掘的建模系统的程序代码等。此外,存储设备11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述电子装置1的总体操作,例如执行与所述终端设备2进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行基于数据挖掘的建模系统等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述电子装置1与其他电子设备之间建立通信连接。本实施例中,网络接口13主要用于将电子装置1与一个或多个终端设备2相连,在电子装置1与一个或多个终端设备2之间建立数据传输通道和通信连接。
所述基于数据挖掘的建模系统存储在存储器11中,包括至少一个存储在存储器11中的计算机可读指令,该至少一个计算机可读指令可被处理器12执行,以实现本发明各实施例的基于数据挖掘的建模的方法;如后续所述,该至少一个计算机可读指令依据其各部分所实现的功能不同,可被划为不同的逻辑模块。
所述基于数据挖掘的建模系统被所述处理器12执行时实现如下步骤:在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型,能够准确地选出相关性最弱的备选指标,提高建模效率。
如图6所示,图6为本发明基于数据挖掘的建模系统一实施例的结构示意图,所述基于数据挖掘的建模系统运行于电子装置中,所述基于数据挖掘的建模系统根据其不同的功能,可以划分为多个功能模块,该基于数据挖掘的建模系统包括:
均分模块101,用于在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
本实施例在接收到待筛选的备选指标后,将备选指标随机均分成K个指标群,以对备选指标进行聚类分析。其中,K为大于1的自然数,例如,共有150个备选指标,若K为10,则随机均分成10个指标群,每个指标群中有15个备选指标。
其中,在接收到150个备选指标之前,例如初始备选指标有200个,可 以通过逐步回归向前向后的方法,设置合适的参数来初步选出150个备选指标。
其中,以建立客户是否发生理赔的模型为例,备选指标包括人口统计特征、生命阶段特征、客户价值信息、产品持有情况、投保行为习惯、历史理赔相关信息等等。
计算模块102,用于计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
本实施例中,群内距离D1指的是备选指标变量与群中心集合的相关系数,该群内距离D1越大,则说明该备选指标与群中心集合的相关性越大。群中心集合由各指标群中的备选指标的均值决定。
群间距离D2指的是备选指标变量与离群最近的群的中心的相关系数,该群间距离D2越小,则说明该备选指标与离群最近的群的中心的相关性越大。
根据各备选指标的群内距离D1和群间距离D2计算筛选评价值A时,同时考虑各备选指标的群内距离D1和群间距离D2,所计算得出的筛选评价值A具备综合性及目的性。
建立模块103,用于根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
本实施例中的筛选评价值A,在根据筛选评价值A选择备选指标时,可选择出相关性最小的备选指标,例如选择筛选评价值A最大的对应的备选指标及选择筛选评价值A最小的对应的备选指标,选择筛选评价值A最大的对应的10个备选指标及筛选评价值A最小的对应的10个备选指标。
另外,所建立的模型例如可以是逻辑回归模型、决策树模型或神经网络模型等。根据指标群的数量K建立模型,例如,在K值较小时可以建立某种模型或某几种,当K值大于某个阈值时可以建立另一种模型或另几种模型,即主要根据指标群的数量来确定所建立的模型。
在一优选的实施例中,如图7所示,在上述图6的实施例的基础上,上述计算模块102包括:
第一计算单元1021,用于计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;
第二计算单元1022,用于计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;
第三计算单元1023,用于计算所述筛选评价值A:A=(1-D1)/(1-D2)。
本实施例中,假设有5个备选指标变量X1、X2、X3、X4、X5,其中Xi=(Xi1,Xi2,…,Xin),n=10,如上表1所示。
其中,这五个备选指标变量组合的群中心是5个备选指标变量的各分量的均值:
M=(m1,m2,…,mn),其中
Figure PCTCN2017091374-appb-000007
其中,m1=(-0.02106-0.02075-0.00183-0.2542+0.517368)/5=0.043906;m2=(-0.02106-0.02075-0.00183+0.305505+0.367093)/5=0.125792;此时可以计算出这五个备选指标变量的中心(即群中心集合)为:
M=(0.043906,0.125792,-1.22513,0.110018,-1.12762,-0.85941,-1.13551,0.871851,-0.8972,-0.45401)。
由上可以得到,备选指标变量X1与群中心的距离:
Figure PCTCN2017091374-appb-000008
Figure PCTCN2017091374-appb-000009
是备选指标变量X1的均值,
Figure PCTCN2017091374-appb-000010
就是群中心集合M的均值,n是样本的个数(指标群的数量),可以计算出X1的均值为-0.73831,M的均值为-0.45473。该距离D1即为备选指标变量X1的群内距离D1。以此类推,可以计算得到各备选指标变量的群内距离D1。
在计算群间距离时,首先计算备选指标变量所在的指标群中心与其他指标群中心的距离:
Figure PCTCN2017091374-appb-000011
这里mpi是各指标群的中心MP的各个分量,mqi是其他指标群的中心MQ的各个分量。
从上述距离d中找出备选指标变量与离群最近的指标群的中心,然后根据备选指标变量与群中心的距离公式计算出备选指标变量的群间距离:
Figure PCTCN2017091374-appb-000012
最后,计算筛选评价值A:A=(1-D1)/(1-D2),另外,筛选评价值A也可以用这种方法计算得到:A=(1-D2)/(1-D1)。
在一优选的实施例中,如图8所示,在上述图6的实施例的基础上,建立模块103
选择单元1031,用于在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;
第一建立单元1032,用于若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;
第二建立单元1033,用于若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并选出的备选指标,以利用各指标群选出的备选指标建立预定的另一指标模型。
本实施例中,可以为每一指标群选出筛选评价值最大的至少一个备选指标和筛选评价值最小的至少一个备选指标,以使得所选出的备选指标之间的相关性最弱。如果所选出的备选指标之间的相关性最弱,则所选出的备选指 标为最具有代表性或者最有效的指标。
本实施例中,如果K值大于等于预设阈值(例如预设阈值为15)时,则利用各指标群挑选出的备选指标建立预定的一指标模型;若K小于预设阈值,则将K增加1,并重新将备选指标均分成(K+1)个指标群,然后计算对应的群内距离D1、群间距离D2及筛选评价值A,根据筛选评价值A选择备选指标,以建立另一个预先确定的模型。
在一优选的实施例中,在上述图6的实施例的基础上,所述基于数据挖掘的建模系统还包括:验证模块,用于利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
本实施例中,在建立模型之后,可以对模型的准确性进行验证。例如可以利用预先确定的验证数据样本对建立的各个模型进行验证,以确定各个模型对应的准确率,然后将准确率最高的模型作为基准模型进行应用。
优选地,验证模块具体用于若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (20)

  1. 一种基于数据挖掘的建模方法,其特征在于,所述基于数据挖掘的建模方法包括:
    S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
    S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
    S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
  2. 根据权利要求1所述的基于数据挖掘的建模方法,其特征在于,所述步骤S2包括:
    S21,计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;
    S22,计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;
    S23,计算所述筛选评价值A:A=(1-D1)/(1-D2)。
  3. 根据权利要求2所述的基于数据挖掘的建模方法,其特征在于,所述步骤S3包括:
    S31,在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;
    S32,若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;
    S33,若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并执行步骤S31,以利用各指标群选出的备选指标建立预定的另一指标模型。
  4. 根据权利要求1至3任一项所述的基于数据挖掘的建模方法,其特征在于,所述步骤S3之后还包括:
    S4,利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
  5. 根据权利要求4所述的基于数据挖掘的建模方法,其特征在于,所述步骤S4包括:
    若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作 为基准模型进行应用;
    若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
  6. 一种基于数据挖掘的建模系统,其特征在于,所述基于数据挖掘的建模系统包括:
    均分模块,用于在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
    计算模块,用于计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
    建立模块,用于根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
  7. 根据权利要求6所述的基于数据挖掘的建模系统,其特征在于,所述计算模块包括:
    第一计算单元,用于计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;
    第二计算单元,用于计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;
    第三计算单元,用于计算所述筛选评价值A:A=(1-D1)/(1-D2)。
  8. 根据权利要求7所述的基于数据挖掘的建模系统,其特征在于,所述建立模块包括:
    选择单元,用于在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;
    第一建立单元,用于若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;
    第二建立单元,用于若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并选出的备选指标,以利用各指标群选出的备选指标建立预定的另一指标模型。
  9. 根据权利要求6至8任一项所述的基于数据挖掘的建模系统,其特征在于,所述基于数据挖掘的建模系统还包括:验证模块,用于利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模 型作为基准模型进行应用。
  10. 根据权利要求9所述的基于数据挖掘的建模系统,其特征在于,所述验证模块具体用于若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
  11. 一种电子装置,其特征在于,所述电子装置包括存储器及与存储器连接的处理器,所述存储器存储有可在所述处理器上运行的基于数据挖掘的建模系统,所述基于数据挖掘的建模系统被所述处理器执行时实现如下步骤:
    S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
    S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
    S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
  12. 根据权利要求11所述电子装置,其特征在于,所述步骤S2包括:
    S21,计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;
    S22,计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;
    S23,计算所述筛选评价值A:A=(1-D1)/(1-D2)。
  13. 根据权利要求12所述的电子装置,其特征在于,所述步骤S3包括:
    S31,在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;
    S32,若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;
    S33,若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并执行步骤S31,以利用各指标群选出的备选指标建立预定的另一指标模型。
  14. 根据权利要求11至13任一项所述的电子装置,其特征在于,所述 基于数据挖掘的建模系统被所述处理器执行时,还实现以下步骤:
    S4,利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
  15. 根据权利要求14所述的电子装置,其特征在于,所述步骤S4包括:
    若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;
    若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有基于数据挖掘的建模系统,所述基于数据挖掘的建模系统被处理器执行时实现以下步骤:
    S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;
    S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;
    S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述步骤S2包括:
    S21,计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;
    S22,计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;
    S23,计算所述筛选评价值A:A=(1-D1)/(1-D2)。
  18. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述步骤S3包括:
    S31,在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;
    S32,若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;
    S33,若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评 价值并执行步骤S31,以利用各指标群选出的备选指标建立预定的另一指标模型。
  19. 根据权利要求16至18任一项所述的计算机可读存储介质,其特征在于,所述基于数据挖掘的建模系统被所述处理器执行时,还实现以下步骤:
    S4,利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
  20. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述步骤S4包括:
    若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;
    若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
PCT/CN2017/091374 2016-12-30 2017-06-30 基于数据挖掘的建模方法、系统、电子装置及存储介质 WO2018120726A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611263812.0 2016-12-30
CN201611263812.0A CN106874933A (zh) 2016-12-30 2016-12-30 基于数据挖掘的建模方法及装置

Publications (1)

Publication Number Publication Date
WO2018120726A1 true WO2018120726A1 (zh) 2018-07-05

Family

ID=59164643

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/091374 WO2018120726A1 (zh) 2016-12-30 2017-06-30 基于数据挖掘的建模方法、系统、电子装置及存储介质

Country Status (2)

Country Link
CN (1) CN106874933A (zh)
WO (1) WO2018120726A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874933A (zh) * 2016-12-30 2017-06-20 平安科技(深圳)有限公司 基于数据挖掘的建模方法及装置
CN108647720A (zh) * 2018-05-10 2018-10-12 上海扩博智能技术有限公司 商品图像的迭代循环识别方法、系统、设备及存储介质
CN110399262B (zh) * 2019-06-17 2022-09-27 平安科技(深圳)有限公司 运维监测告警收敛方法、装置、计算机设备及存储介质
CN113723831A (zh) * 2021-09-02 2021-11-30 深圳前海微众银行股份有限公司 建模复杂度调整方法、装置、设备及计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101806396A (zh) * 2010-04-24 2010-08-18 上海交通大学 城市供水管网压力分布图的生成方法
CN103208039A (zh) * 2012-01-13 2013-07-17 株式会社日立制作所 软件项目风险评价方法及装置
CN103729550A (zh) * 2013-12-18 2014-04-16 河海大学 基于传播时间聚类分析的多模型集成洪水预报方法
CN103942604A (zh) * 2013-01-18 2014-07-23 上海安迪泰信息技术有限公司 基于森林区分度模型的预测方法及系统
CN106874933A (zh) * 2016-12-30 2017-06-20 平安科技(深圳)有限公司 基于数据挖掘的建模方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101806396A (zh) * 2010-04-24 2010-08-18 上海交通大学 城市供水管网压力分布图的生成方法
CN103208039A (zh) * 2012-01-13 2013-07-17 株式会社日立制作所 软件项目风险评价方法及装置
CN103942604A (zh) * 2013-01-18 2014-07-23 上海安迪泰信息技术有限公司 基于森林区分度模型的预测方法及系统
CN103729550A (zh) * 2013-12-18 2014-04-16 河海大学 基于传播时间聚类分析的多模型集成洪水预报方法
CN106874933A (zh) * 2016-12-30 2017-06-20 平安科技(深圳)有限公司 基于数据挖掘的建模方法及装置

Also Published As

Publication number Publication date
CN106874933A (zh) 2017-06-20

Similar Documents

Publication Publication Date Title
CN107704625B (zh) 字段匹配方法和装置
WO2018120726A1 (zh) 基于数据挖掘的建模方法、系统、电子装置及存储介质
CN107613022B (zh) 内容推送方法、装置及计算机设备
CN108833458B (zh) 一种应用推荐方法、装置、介质及设备
WO2018166113A1 (zh) 随机森林模型训练的方法、电子装置及存储介质
CN106980623B (zh) 一种数据模型的确定方法及装置
US8407774B2 (en) Cloud authentication processing and verification
US11132362B2 (en) Method and system of optimizing database system, electronic device and storage medium
AU2017410367B2 (en) System and method for learning-based group tagging
US9715532B1 (en) Systems and methods for content object optimization
US8091073B2 (en) Scaling instruction intervals to identify collection points for representative instruction traces
WO2020220758A1 (zh) 一种异常交易节点的检测方法及装置
US10133775B1 (en) Run time prediction for data queries
US10452717B2 (en) Technologies for node-degree based clustering of data sets
WO2019061664A1 (zh) 电子装置、基于用户上网数据的产品推荐方法及存储介质
US20240214428A1 (en) Platform for management and tracking of collaborative projects
US9659056B1 (en) Providing an explanation of a missing fact estimate
WO2019061667A1 (zh) 电子装置、数据处理方法、系统及计算机可读存储介质
WO2016122575A1 (en) Product, operating system and topic based recommendations
CN113592036A (zh) 流量作弊行为识别方法、装置及存储介质和电子设备
CN112162966A (zh) 分布式存储系统参数调节方法、装置及电子设备和介质
Yu et al. Achieving load-balanced, redundancy-free cluster caching with selective partition
WO2019218517A1 (zh) 服务器、文本数据的处理方法及存储介质
CN114238349A (zh) 数据校验方法、装置、设备及介质
US20190138931A1 (en) Apparatus and method of introducing probability and uncertainty via order statistics to unsupervised data classification via clustering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17888178

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17888178

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 04.10.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17888178

Country of ref document: EP

Kind code of ref document: A1