WO2018120726A1 - 基于数据挖掘的建模方法、系统、电子装置及存储介质 - Google Patents
基于数据挖掘的建模方法、系统、电子装置及存储介质 Download PDFInfo
- Publication number
- WO2018120726A1 WO2018120726A1 PCT/CN2017/091374 CN2017091374W WO2018120726A1 WO 2018120726 A1 WO2018120726 A1 WO 2018120726A1 CN 2017091374 W CN2017091374 W CN 2017091374W WO 2018120726 A1 WO2018120726 A1 WO 2018120726A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- indicator
- group
- candidate
- model
- distance
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Definitions
- the present invention relates to the field of data mining technologies, and in particular, to a data mining based modeling method, system, electronic device, and computer readable storage medium.
- the number of alternative modeling indicators is usually collected, sometimes as many as 200 or more, but usually only a part of the effective modeling, for example, in 200 alternative construction Only 30 of the model indicators may be valid.
- the existing method is to manually select high-correlation indicators for modeling. This manual selection method is subjective. Therefore, the effective indicators of modeling cannot be accurately selected, and the efficiency of modeling is low.
- the object of the present invention is to provide a data mining based modeling method, system, electronic device and computer readable storage medium, which aims to accurately select the weakest candidate index and improve the modeling efficiency.
- the present invention provides a data mining based modeling method, and the data mining based modeling method includes:
- the present invention also provides a data mining-based modeling apparatus, the data mining-based modeling apparatus comprising:
- a equalization module configured to divide the candidate indicators into K indicator groups after receiving the candidate indicators to be screened
- a calculation module configured to calculate an intra-group distance D1 and an inter-group distance D2 of each candidate indicator in each indicator group, and calculate each candidate indicator according to the intra-group distance D1 and the inter-group distance D2 and based on a predetermined calculation rule Screening evaluation value A;
- Establishing a module configured to select an candidate indicator according to the screening evaluation value A, based on the K value And use the selected alternative indicators to establish an indicator model.
- the present invention also provides an electronic device including a memory and a processor coupled to the memory, the memory storing a data mining-based modeling system operable on the processor, The data mining based modeling system is implemented by the processor to implement the following steps:
- the present invention also provides a computer readable storage medium having a data mining based modeling system stored thereon, the data mining based modeling system being implemented by a processor The following steps:
- the beneficial effects of the present invention are as follows: after dividing the candidate index into a plurality of indicator groups, first calculating the intra-group distance D1 and the inter-group distance D2 of each candidate index in each indicator group, according to the intra-group distance D1
- the screening evaluation value A is calculated by the distance D2 between the groups. Since the screening evaluation value A comprehensively considers the intra-group distance D1 and the inter-group distance D2 of the candidate index, the candidate index with the least correlation can be selected according to the screening evaluation value A.
- the selected indicators are the most representative or most effective indicators, without manual selection, the accuracy of selection is high, and the modeling efficiency is high.
- FIG. 1 is a schematic flow chart of a first embodiment of a data mining based modeling method according to the present invention
- step S2 is a schematic diagram of a refinement process of step S2 shown in FIG. 1;
- step S3 is a schematic diagram of a refinement process of step S3 shown in FIG. 1;
- FIG. 4 is a schematic flow chart of a second embodiment of a data mining based modeling method according to the present invention.
- FIG. 5 is a schematic diagram of an application environment of an embodiment of a data mining based modeling method according to the present invention.
- FIG. 6 is a schematic structural diagram of an embodiment of a data mining based modeling system according to the present invention.
- FIG. 7 is a schematic structural diagram of the computing module shown in FIG. 6;
- FIG. 8 is a schematic structural view of the building module shown in FIG. 6.
- FIG. 1 is a schematic flowchart of an embodiment of a data mining based modeling method according to an embodiment of the present invention.
- the data mining based modeling method is applied to an electronic device, and includes the following steps:
- Step S1 after receiving the candidate indicators to be selected, dividing the candidate indicators into K indicator groups;
- the present embodiment randomly divides the candidate indicators into K indicator groups to perform cluster analysis on the candidate indicators.
- K is a natural number greater than 1, for example, there are 150 candidate indicators. If K is 10, then the score is randomly divided into 10 indicator groups, and there are 15 candidate indicators in each indicator group.
- 150 candidate indicators for example, there are 200 initial candidate indicators, and 150 candidate indicators can be initially selected by stepping back to the forward and backward methods and setting appropriate parameters.
- the candidate indicators include demographic characteristics, life stage characteristics, customer value information, product holding status, insurance behavior habits, historical claims related information, and the like.
- Step S2 calculating an intra-group distance D1 and an inter-group distance D2 for each candidate index in each index group, and calculating a corresponding correspondence of each candidate index according to the intra-group distance D1 and the inter-group distance D2 and based on a predetermined calculation rule.
- the intra-group distance D1 refers to the correlation coefficient between the candidate index variable and the group center set. The larger the distance D1 within the group, the greater the correlation between the candidate index and the group center set.
- the group center set is determined by the mean of the candidate indicators in each indicator group.
- the inter-group distance D2 refers to the correlation coefficient between the candidate index variable and the center of the group closest to the group. The smaller the distance D2 between the groups, the greater the correlation between the candidate index and the center of the group closest to the group. .
- the screening evaluation value A is calculated according to the intra-group distance D1 and the inter-group distance D2 of each candidate index, the intra-group distance D1 and the inter-group distance D2 of each candidate index are simultaneously considered, and the calculated screening evaluation value A has comprehensive Sex and purpose.
- Step S3 selecting an alternative indicator according to the screening evaluation value A, and establishing an indicator model based on the K value and using the selected candidate indicator.
- the candidate index with the least correlation may be selected, for example, selecting the corresponding candidate index with the largest screening evaluation value A and selecting the screening evaluation.
- the corresponding candidate index with the smallest value A selects the corresponding 10 candidate indicators with the largest evaluation value A and the corresponding 10 candidate indicators with the smallest evaluation value A.
- the established model may be, for example, a logistic regression model, a decision tree model, or a neural network model.
- the model is established according to the number K of the indicator group. For example, when the K value is small, a certain model or a certain type can be established. When the K value is greater than a certain threshold, another model or another model can be established, that is, mainly according to the indicator. The number of groups to determine the model established.
- the first embodiment calculates the intra-group distance D1 and the inter-group distance D2 of each candidate index in each indicator group, according to the intra-group distance.
- the screening evaluation value A is calculated by D1 and the inter-group distance D2. Since the screening evaluation value A comprehensively considers the intra-group distance D1 and the inter-group distance D2 of the candidate index, the least relevant candidate can be selected according to the screening evaluation value A.
- the indicator that is, the selected candidate index is the most representative or most effective indicator, without manual selection, the accuracy of selection is high, and the modeling efficiency is high.
- step S2 includes:
- the group center of the combination of the five candidate index variables is the mean of each component of the five candidate index variables:
- M (0.043906, 0.125792, -1.22313, 0.110018, -1.12762, -0.85941, - 1.13551, 0.871851, -0.8972, -0.45401).
- the distance between the alternative indicator variable X1 and the center of the group Assume Is the mean of the alternative indicator variable X1, It is the mean of the group center set M, and n is the number of samples (the number of indicator groups).
- the average value of X1 can be calculated as -0.73831, and the mean value of M is -0.45473.
- This distance D1 is the intra-group distance D1 of the alternative index variable X1.
- the intra-group distance D1 of each candidate index variable can be calculated.
- m pi is the respective components of the center M P of each indicator group
- m qi is the respective components of the center M Q of the other indicator groups.
- the foregoing step S3 includes:
- each indicator group selecting at least one candidate indicator corresponding to the maximum screening evaluation value and at least one candidate indicator corresponding to the minimum screening evaluation value;
- K value is greater than or equal to a preset threshold, use a candidate indicator selected by each indicator group to establish a predetermined indicator model;
- step S33 If the K value is less than a preset threshold, increase the K value, recalculate the screening evaluation value, and perform step S31 to establish a predetermined another indicator model by using the candidate indicator selected by each indicator group.
- At least one candidate indicator with the largest screening evaluation value and at least one candidate indicator with the smallest screening evaluation value may be selected for each indicator group, so that the correlation between the selected candidate indicators is the most weak. If the correlation between the selected alternative indicators is the weakest, then the selected alternative indicator is the most representative or most effective indicator.
- the candidate indicator selected by each indicator group is used to establish a predetermined indicator model; if K is less than the preset threshold, K is increased by 1, and the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
- the preset threshold for example, the preset threshold is 15
- K is less than the preset threshold
- K is increased by 1
- the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
- the method further includes:
- S4 Verify the established indicator model by using predetermined verification data samples, and apply the indicator model with the highest accuracy after verification as the reference model.
- the accuracy of the model can be verified.
- the established models can be verified by using predetermined verification data samples to determine the accuracy of each model, and then the model with the highest accuracy is used as the reference model.
- the indicator model with the highest accuracy is 1, the indicator model with the highest accuracy is applied as the reference model;
- the indicator model with the highest accuracy is greater than 1, randomly select the indicator model with the highest accuracy as the reference model, or increase the number of verification data samples until the number of the highest accuracy indicator model is 1, and The indicator model with the highest accuracy is applied as a reference model.
- FIG. 5 is a schematic diagram of an application environment of a preferred embodiment of the data mining based modeling method of the present invention.
- the application environment diagram includes an electronic device 1 and a terminal device 2.
- the electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
- the terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, or a smart phone.
- PDA Personal Digital Assistant
- game consoles Internet Protocol Television (IPTV)
- IPTV Internet Protocol Television
- smart wearable devices navigation devices, etc.
- mobile devices such as digital TVs, desktop computers, Fixed terminal for notebooks, servers, etc.
- the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
- the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing.
- a super virtual computer consisting of a group of loosely coupled computers.
- the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 communicably connected to each other through a system bus, and the memory 11 stores data-based mining that can be run on the processor 12. Modeling system. It is to be noted that FIG. 5 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
- the storage device 11 includes a memory and at least one type of readable storage medium.
- the memory provides a cache for the operation of the electronic device 1;
- the readable storage medium may be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM).
- a non-volatile storage medium such as a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, or the like.
- the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
- a storage device such as a plug-in equipped on the electronic device 1 Connected hard drives, Smart Memory Cards (SMC), Secure Digital (SD) cards, Flash Cards, etc.
- the readable storage medium of the storage device 11 is generally used to store an operating system installed on the electronic device 1 and various types of application software, such as program code of a data mining-based modeling system in an embodiment of the present invention. . Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.
- the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
- the processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with the terminal device 2.
- the processor 12 is configured to run program code or process data stored in the memory 11, such as running a data mining based modeling system or the like.
- the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
- the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.
- the data mining based modeling system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement various embodiments of the present invention
- the method of data mining based modeling; as described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
- the candidate indicators are equally divided into K indicator groups;
- the intra-group distance D1 and the inter-group distance D2 of each candidate index are calculated according to the intra-group distance D1 and the inter-group distance D2 and the screening evaluation value A corresponding to each candidate index is calculated according to a predetermined calculation rule;
- the evaluation value A selects an alternative indicator, and based on the K value and using the selected candidate indicator to establish an indicator model, the candidate index with the weakest correlation can be accurately selected to improve the modeling efficiency.
- FIG. 6 is a schematic structural diagram of an embodiment of a data mining-based modeling system, where the data mining-based modeling system runs in an electronic device, and the data mining-based modeling system is based on Its different functions can be divided into multiple functional modules.
- the data mining-based modeling system includes:
- the equalization module 101 is configured to divide the candidate indicators into K indicator groups after receiving the candidate indicators to be selected;
- the present embodiment randomly divides the candidate indicators into K indicator groups to perform cluster analysis on the candidate indicators.
- K is a natural number greater than 1, for example, there are 150 candidate indicators. If K is 10, then the score is randomly divided into 10 indicator groups, and there are 15 candidate indicators in each indicator group.
- the candidate indicators include demographic characteristics, life stage characteristics, customer value information, product holding status, insurance behavior habits, historical claims related information, and the like.
- the calculation module 102 is configured to calculate an intra-group distance D1 and an inter-group distance D2 for each candidate indicator in each indicator group, and calculate each candidate index according to the intra-group distance D1 and the inter-group distance D2 and based on a predetermined calculation rule.
- the intra-group distance D1 refers to the correlation coefficient between the candidate index variable and the group center set. The larger the distance D1 within the group, the greater the correlation between the candidate index and the group center set.
- the group center set is determined by the mean of the candidate indicators in each indicator group.
- the inter-group distance D2 refers to the correlation coefficient between the candidate index variable and the center of the group closest to the group. The smaller the distance D2 between the groups, the greater the correlation between the candidate index and the center of the group closest to the group. .
- the screening evaluation value A is calculated according to the intra-group distance D1 and the inter-group distance D2 of each candidate index, the intra-group distance D1 and the inter-group distance D2 of each candidate index are simultaneously considered, and the calculated screening evaluation value A has comprehensive Sex and purpose.
- the establishing module 103 is configured to select an candidate indicator according to the screening evaluation value A, and establish an indicator model based on the K value and using the selected candidate indicator.
- the candidate index with the least correlation may be selected, for example, selecting the corresponding candidate index with the largest screening evaluation value A and selecting the screening evaluation.
- the corresponding candidate index with the smallest value A selects the corresponding 10 candidate indicators with the largest evaluation value A and the corresponding 10 candidate indicators with the smallest evaluation value A.
- the established model may be, for example, a logistic regression model, a decision tree model, or a neural network model.
- the model is established according to the number K of the indicator group. For example, when the K value is small, a certain model or a certain type can be established. When the K value is greater than a certain threshold, another model or another model can be established, that is, mainly according to the indicator. The number of groups to determine the model established.
- the calculation module 102 includes:
- a first calculating unit 1021 configured to calculate an average value of the candidate indicators under each indicator group, obtain a group center set according to the average value, and calculate a distance between each candidate indicator and the group center set according to the group center set Calculating the distance as the intra-group distance D1;
- the second calculating unit 1022 is configured to calculate a center distance between the indicator group in which each candidate indicator is located and other indicator groups, and obtain a corresponding indicator group with the smallest distance from the center distance, and calculate the group according to the obtained indicator group.
- the group center of the combination of the five candidate index variables is the mean of each component of the five candidate index variables:
- M (0.043906, 0.125792, -1.22313, 0.110018, -1.12762, -0.85941, - 1.13551, 0.871851, -0.8972, -0.45401).
- the distance between the alternative indicator variable X1 and the center of the group Assume Is the mean of the alternative indicator variable X1, It is the mean of the group center set M, and n is the number of samples (the number of indicator groups).
- the average value of X1 can be calculated as -0.73831, and the mean value of M is -0.45473.
- This distance D1 is the intra-group distance D1 of the alternative index variable X1.
- the intra-group distance D1 of each candidate index variable can be calculated.
- m pi is the respective components of the center M P of each indicator group
- m qi is the respective components of the center M Q of the other indicator groups.
- the module 103 is created.
- the selecting unit 1031 is configured to select, in each indicator group, at least one candidate indicator corresponding to the maximum screening evaluation value and at least one candidate indicator corresponding to the minimum screening evaluation value;
- the first establishing unit 1032 is configured to: when the K value is greater than or equal to a preset threshold, use a candidate indicator selected by each indicator group to establish a predetermined indicator model;
- the second establishing unit 1033 is configured to: if the K value is less than a preset threshold, increase the K value, recalculate the screening evaluation value, and select the selected indicator to use the candidate selected by each indicator group The indicator establishes another predetermined indicator model.
- At least one candidate indicator with the largest screening evaluation value and at least one candidate indicator with the smallest screening evaluation value may be selected for each indicator group, so that the correlation between the selected candidate indicators is the most weak. If the correlation between the selected alternative indicators is the weakest, then the selected alternatives Marked as the most representative or most effective indicator.
- the candidate indicator selected by each indicator group is used to establish a predetermined indicator model; if K is less than the preset threshold, K is increased by 1, and the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
- the preset threshold for example, the preset threshold is 15
- K is less than the preset threshold
- K is increased by 1
- the candidate indicators are further divided into (K+1) indicator groups, and then the corresponding intra-group distance D1, inter-group distance D2, and screening evaluation value A are calculated, and the candidate index is selected according to the screening evaluation value A.
- the data mining based modeling system further includes: a verification module, configured to verify the established indicator model by using predetermined verification data samples.
- the indicator model with the highest accuracy after verification is applied as a benchmark model.
- the accuracy of the model can be verified.
- the established models can be verified by using predetermined verification data samples to determine the accuracy of each model, and then the model with the highest accuracy is used as the reference model.
- the verification module is specifically configured to apply the indicator model with the highest accuracy as the reference model if the number of the indicator model with the highest accuracy is 1, and randomly select if the number of the indicator model with the highest accuracy is greater than 1.
- An indicator model with the highest accuracy rate is applied as a benchmark model, or the number of verification data samples is increased until the number of indicator models with the highest accuracy is 1, and the index model with the highest accuracy is applied as a reference model.
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
X1 | X2 | X3 | X4 | X5 |
-0.02106 | -0.02075 | -0.00183 | -0.2542 | 0.517368 |
-0.02106 | -0.02075 | -0.00183 | 0.305505 | 0.367093 |
-1.54935 | -1.54959 | -1.49993 | -1.00909 | -0.51768 |
-0.02106 | -0.02075 | 0.316522 | 0.305505 | -0.03013 |
-1.54935 | -1.54959 | -1.49993 | -1.00909 | -0.03013 |
-1.54935 | -1.54959 | -1.49993 | -0.2542 | 0.556034 |
-1.54935 | -1.54959 | -1.49993 | -0.2542 | -0.8245 |
0.936479 | 0.937007 | 0.909081 | 1.020655 | 0.556034 |
-1.54935 | -1.54959 | -1.49993 | -0.2542 | 0.367093 |
-0.50968 | -0.50945 | -0.47902 | -0.2542 | -0.51768 |
Claims (20)
- 一种基于数据挖掘的建模方法,其特征在于,所述基于数据挖掘的建模方法包括:S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
- 根据权利要求1所述的基于数据挖掘的建模方法,其特征在于,所述步骤S2包括:S21,计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;S22,计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;S23,计算所述筛选评价值A:A=(1-D1)/(1-D2)。
- 根据权利要求2所述的基于数据挖掘的建模方法,其特征在于,所述步骤S3包括:S31,在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;S32,若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;S33,若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并执行步骤S31,以利用各指标群选出的备选指标建立预定的另一指标模型。
- 根据权利要求1至3任一项所述的基于数据挖掘的建模方法,其特征在于,所述步骤S3之后还包括:S4,利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
- 根据权利要求4所述的基于数据挖掘的建模方法,其特征在于,所述步骤S4包括:若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作 为基准模型进行应用;若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
- 一种基于数据挖掘的建模系统,其特征在于,所述基于数据挖掘的建模系统包括:均分模块,用于在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;计算模块,用于计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;建立模块,用于根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
- 根据权利要求6所述的基于数据挖掘的建模系统,其特征在于,所述计算模块包括:第一计算单元,用于计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;第二计算单元,用于计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;第三计算单元,用于计算所述筛选评价值A:A=(1-D1)/(1-D2)。
- 根据权利要求7所述的基于数据挖掘的建模系统,其特征在于,所述建立模块包括:选择单元,用于在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;第一建立单元,用于若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;第二建立单元,用于若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并选出的备选指标,以利用各指标群选出的备选指标建立预定的另一指标模型。
- 根据权利要求6至8任一项所述的基于数据挖掘的建模系统,其特征在于,所述基于数据挖掘的建模系统还包括:验证模块,用于利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模 型作为基准模型进行应用。
- 根据权利要求9所述的基于数据挖掘的建模系统,其特征在于,所述验证模块具体用于若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
- 一种电子装置,其特征在于,所述电子装置包括存储器及与存储器连接的处理器,所述存储器存储有可在所述处理器上运行的基于数据挖掘的建模系统,所述基于数据挖掘的建模系统被所述处理器执行时实现如下步骤:S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
- 根据权利要求11所述电子装置,其特征在于,所述步骤S2包括:S21,计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;S22,计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;S23,计算所述筛选评价值A:A=(1-D1)/(1-D2)。
- 根据权利要求12所述的电子装置,其特征在于,所述步骤S3包括:S31,在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;S32,若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;S33,若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评价值并执行步骤S31,以利用各指标群选出的备选指标建立预定的另一指标模型。
- 根据权利要求11至13任一项所述的电子装置,其特征在于,所述 基于数据挖掘的建模系统被所述处理器执行时,还实现以下步骤:S4,利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
- 根据权利要求14所述的电子装置,其特征在于,所述步骤S4包括:若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有基于数据挖掘的建模系统,所述基于数据挖掘的建模系统被处理器执行时实现以下步骤:S1,在收到待筛选的备选指标后,将所述备选指标均分成K个指标群;S2,计算各指标群中每一备选指标的群内距离D1和群间距离D2,根据所述群内距离D1和群间距离D2并基于预定的计算规则计算各备选指标对应的筛选评价值A;S3,根据所述筛选评价值A选择备选指标,基于所述K值并利用所选择的备选指标建立指标模型。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述步骤S2包括:S21,计算每一指标群下的备选指标的均值,根据所述均值获取群中心集合,根据所述群中心集合计算每一备选指标与所述群中心集合的距离,以计算得到的距离作为所述群内距离D1;S22,计算每一备选指标所在的指标群与其他各指标群的中心距离,从所述中心距离中获取距离最小的对应的指标群,根据所获取的指标群计算所述群间距离D2;S23,计算所述筛选评价值A:A=(1-D1)/(1-D2)。
- 根据权利要求17所述的计算机可读存储介质,其特征在于,所述步骤S3包括:S31,在每一指标群中,选出最大筛选评价值对应的至少一个备选指标和最小筛选评价值对应的至少一个备选指标;S32,若所述K值大于等于预设阈值时,则利用各指标群挑选出的备选指标建立预定的一指标模型;S33,若所述K值小于预设阈值时,则增大所述K值,重新计算筛选评 价值并执行步骤S31,以利用各指标群选出的备选指标建立预定的另一指标模型。
- 根据权利要求16至18任一项所述的计算机可读存储介质,其特征在于,所述基于数据挖掘的建模系统被所述处理器执行时,还实现以下步骤:S4,利用预定的验证数据样本对所建立的指标模型进行验证,将验证后准确率最高的指标模型作为基准模型进行应用。
- 根据权利要求19所述的计算机可读存储介质,其特征在于,所述步骤S4包括:若准确率最高的指标模型的数量为1,则将该准确率最高的指标模型作为基准模型进行应用;若准确率最高的指标模型的数量大于1,则随机选择一准确率最高的指标模型作为基准模型进行应用,或者,增加验证数据样本的数量,直至准确率最高的指标模型的数量为1,并将该准确率最高的指标模型作为基准模型进行应用。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611263812.0 | 2016-12-30 | ||
CN201611263812.0A CN106874933A (zh) | 2016-12-30 | 2016-12-30 | 基于数据挖掘的建模方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018120726A1 true WO2018120726A1 (zh) | 2018-07-05 |
Family
ID=59164643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/091374 WO2018120726A1 (zh) | 2016-12-30 | 2017-06-30 | 基于数据挖掘的建模方法、系统、电子装置及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106874933A (zh) |
WO (1) | WO2018120726A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106874933A (zh) * | 2016-12-30 | 2017-06-20 | 平安科技(深圳)有限公司 | 基于数据挖掘的建模方法及装置 |
CN108647720A (zh) * | 2018-05-10 | 2018-10-12 | 上海扩博智能技术有限公司 | 商品图像的迭代循环识别方法、系统、设备及存储介质 |
CN110399262B (zh) * | 2019-06-17 | 2022-09-27 | 平安科技(深圳)有限公司 | 运维监测告警收敛方法、装置、计算机设备及存储介质 |
CN113723831A (zh) * | 2021-09-02 | 2021-11-30 | 深圳前海微众银行股份有限公司 | 建模复杂度调整方法、装置、设备及计算机可读存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101806396A (zh) * | 2010-04-24 | 2010-08-18 | 上海交通大学 | 城市供水管网压力分布图的生成方法 |
CN103208039A (zh) * | 2012-01-13 | 2013-07-17 | 株式会社日立制作所 | 软件项目风险评价方法及装置 |
CN103729550A (zh) * | 2013-12-18 | 2014-04-16 | 河海大学 | 基于传播时间聚类分析的多模型集成洪水预报方法 |
CN103942604A (zh) * | 2013-01-18 | 2014-07-23 | 上海安迪泰信息技术有限公司 | 基于森林区分度模型的预测方法及系统 |
CN106874933A (zh) * | 2016-12-30 | 2017-06-20 | 平安科技(深圳)有限公司 | 基于数据挖掘的建模方法及装置 |
-
2016
- 2016-12-30 CN CN201611263812.0A patent/CN106874933A/zh active Pending
-
2017
- 2017-06-30 WO PCT/CN2017/091374 patent/WO2018120726A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101806396A (zh) * | 2010-04-24 | 2010-08-18 | 上海交通大学 | 城市供水管网压力分布图的生成方法 |
CN103208039A (zh) * | 2012-01-13 | 2013-07-17 | 株式会社日立制作所 | 软件项目风险评价方法及装置 |
CN103942604A (zh) * | 2013-01-18 | 2014-07-23 | 上海安迪泰信息技术有限公司 | 基于森林区分度模型的预测方法及系统 |
CN103729550A (zh) * | 2013-12-18 | 2014-04-16 | 河海大学 | 基于传播时间聚类分析的多模型集成洪水预报方法 |
CN106874933A (zh) * | 2016-12-30 | 2017-06-20 | 平安科技(深圳)有限公司 | 基于数据挖掘的建模方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN106874933A (zh) | 2017-06-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107704625B (zh) | 字段匹配方法和装置 | |
WO2018120726A1 (zh) | 基于数据挖掘的建模方法、系统、电子装置及存储介质 | |
CN107613022B (zh) | 内容推送方法、装置及计算机设备 | |
CN108833458B (zh) | 一种应用推荐方法、装置、介质及设备 | |
WO2018166113A1 (zh) | 随机森林模型训练的方法、电子装置及存储介质 | |
CN106980623B (zh) | 一种数据模型的确定方法及装置 | |
US8407774B2 (en) | Cloud authentication processing and verification | |
US11132362B2 (en) | Method and system of optimizing database system, electronic device and storage medium | |
AU2017410367B2 (en) | System and method for learning-based group tagging | |
US9715532B1 (en) | Systems and methods for content object optimization | |
US8091073B2 (en) | Scaling instruction intervals to identify collection points for representative instruction traces | |
WO2020220758A1 (zh) | 一种异常交易节点的检测方法及装置 | |
US10133775B1 (en) | Run time prediction for data queries | |
US10452717B2 (en) | Technologies for node-degree based clustering of data sets | |
WO2019061664A1 (zh) | 电子装置、基于用户上网数据的产品推荐方法及存储介质 | |
US20240214428A1 (en) | Platform for management and tracking of collaborative projects | |
US9659056B1 (en) | Providing an explanation of a missing fact estimate | |
WO2019061667A1 (zh) | 电子装置、数据处理方法、系统及计算机可读存储介质 | |
WO2016122575A1 (en) | Product, operating system and topic based recommendations | |
CN113592036A (zh) | 流量作弊行为识别方法、装置及存储介质和电子设备 | |
CN112162966A (zh) | 分布式存储系统参数调节方法、装置及电子设备和介质 | |
Yu et al. | Achieving load-balanced, redundancy-free cluster caching with selective partition | |
WO2019218517A1 (zh) | 服务器、文本数据的处理方法及存储介质 | |
CN114238349A (zh) | 数据校验方法、装置、设备及介质 | |
US20190138931A1 (en) | Apparatus and method of introducing probability and uncertainty via order statistics to unsupervised data classification via clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17888178 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17888178 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 04.10.2019) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17888178 Country of ref document: EP Kind code of ref document: A1 |